Examples for basic BN learning,
Posted on Thu 07 July 2016 in score_based
I’ll soon finish basic score-based structure estimation for BayesianModels. Below is the current state of my PR, with two examples.
Changes in pgmpy/estimators/
I rearranged the estimator classes to inherit from each other like this:
. MaximumLikelihoodEstimator
/
ParameterEstimator -- BayesianEstimator
/
BaseEstimator ExhaustiveSearch
| \ /
| StructureEstimator -- HillClimbSearch
| \
| ConstraintBasedEstimator
|
|
| BayesianScore
| /
StructureScore -- BicScore
BaseEstimator takes a data set and optionally state_names and a flag for how to handle missing values. ParameterEstimator and its subclasses additionally take a model. All *Search-classes are initialized with a StructureScore-instance (or by default BayesianScore) in addition to the data set.
Example
Given a data sets with 5 or less variables, we can search through all BayesianModels and find the best-scoring one, using ExhaustiveSearch (currently 5 vars already takes a few minutes, but can be made faster):
import pandas as pd import numpy as np from pgmpy.estimators import ExhaustiveSearch # create random data sample with 3 variables, where B and C are identical: data = pd.DataFrame(np.random.randint(0, 5, size=(5000, 2)), columns=list('AB')) data['C'] = data['B'] est = ExhaustiveSearch(data) best_model = est.estimate() print(best_model.nodes()) print(best_model.edges()) print('\nall scores:') for score, model in est.all_scores(): print(score, model.edges())
The example first prints nodes and edges of the best-fitting model and then the scores for all possible BayesianModels for this data set:
['A','B','C'] [('B','C')] all scores: -24243.15030635083 [('A', 'C'), ('A', 'B')] -24243.149854387288 [('A', 'B'), ('C', 'A')] -24243.149854387288 [('A', 'C'), ('B', 'A')] -24211.96205284525 [('A', 'B')] -24211.96205284525 [('A', 'C')] -24211.961600881707 [('B', 'A')] -24211.961600881707 [('C', 'A')] -24211.961600881707 [('C', 'A'), ('B', 'A')] -24180.77379933967 [] -16603.134367431743 [('A', 'C'), ('A', 'B'), ('B', 'C')] -16603.13436743174 [('A', 'C'), ('A', 'B'), ('C', 'B')] -16603.133915468195 [('A', 'B'), ('C', 'A'), ('C', 'B')] -16603.133915468195 [('A', 'C'), ('B', 'A'), ('B', 'C')] -16571.946113926162 [('A', 'C'), ('B', 'C')] -16571.94611392616 [('A', 'B'), ('C', 'B')] -16274.052597732147 [('A', 'B'), ('B', 'C')] -16274.052597732145 [('A', 'C'), ('C', 'B')] -16274.0521457686 [('B', 'A'), ('B', 'C')] -16274.0521457686 [('C', 'A'), ('B', 'C')] -16274.0521457686 [('C', 'B'), ('B', 'A')] -16274.0521457686 [('C', 'A'), ('C', 'B')] -16274.0521457686 [('C', 'A'), ('B', 'A'), ('B', 'C')] -16274.0521457686 [('C', 'A'), ('C', 'B'), ('B', 'A')] -16242.864344226566 [('B', 'C')] -16242.864344226564 [('C', 'B')]
There is a big jump in score between those models where B and C influence each other (~-16274) and the rest (~-24211), as expected since they are correlated.
Example 2
I tried the same with the Kaggle titanic data set:
import pandas as pd import numpy as np from pgmpy.estimators import ExhaustiveSearch # data_link - "https://www.kaggle.com/c/titanic/download/train.csv" titanic_data = pd.read_csv('testdata/titanic_train.csv') titanic_data = titanic_data[["Survived", "Sex", "Pclass"]] est = ExhaustiveSearch(titanic_data) for score, model in est.all_scores(): print(score, model.edges())
Output:
-2072.9132364404695 []
-2069.071694164769 [('Pclass', 'Sex')]
-2069.0144197068785 [('Sex', 'Pclass')]
-2025.869489762676 [('Survived', 'Pclass')]
-2025.8559302273054 [('Pclass', 'Survived')]
-2022.0279474869753 [('Survived', 'Pclass'), ('Pclass', 'Sex')]
-2022.0143879516047 [('Pclass', 'Survived'), ('Pclass', 'Sex')]
-2021.9571134937144 [('Sex', 'Pclass'), ('Pclass', 'Survived')]
-2017.5258065853768 [('Survived', 'Pclass'), ('Sex', 'Pclass')]
-1941.3075053892835 [('Survived', 'Sex')]
-1941.2720031713893 [('Sex', 'Survived')]
-1937.4304608956886 [('Sex', 'Survived'), ('Pclass', 'Sex')]
-1937.4086886556925 [('Survived', 'Sex'), ('Sex', 'Pclass')]
-1937.3731864377983 [('Sex', 'Survived'), ('Sex', 'Pclass')]
-1934.134485060888 [('Survived', 'Sex'), ('Pclass', 'Sex')]
-1894.2637587114903 [('Survived', 'Sex'), ('Survived', 'Pclass')]
-1894.2501991761196 [('Survived', 'Sex'), ('Pclass', 'Survived')]
-1894.228256493596 [('Survived', 'Pclass'), ('Sex', 'Survived')]
-1891.0630673606006 [('Sex', 'Survived'), ('Pclass', 'Survived')]
-1887.2215250849 [('Sex', 'Survived'), ('Pclass', 'Survived'), ('Pclass', 'Sex')]
-1887.1642506270096 [('Sex', 'Survived'), ('Sex', 'Pclass'), ('Pclass', 'Survived')]
-1887.0907383830947 [('Survived', 'Sex'), ('Survived', 'Pclass'), ('Pclass', 'Sex')]
-1887.077178847724 [('Survived', 'Sex'), ('Pclass', 'Survived'), ('Pclass', 'Sex')]
-1885.9200755341908 [('Survived', 'Sex'), ('Survived', 'Pclass'), ('Sex', 'Pclass')]
-1885.8845733162966 [('Survived', 'Pclass'), ('Sex', 'Survived'), ('Sex', 'Pclass')]
Here it didn’t work as I hoped. [('Sex', 'Survived'), ('Pclass', 'Survived')] has the best score among models with 2 or less edges, but every model with 3 edges scores better. I didn’t have a closer look at the dataset yet, but a weak dependency between Sex and PClass would explain this.