DataTechNotes: Bayesian Network in R

A Bayesian Network (BN) is a probabilistic model based on directed a cyclic graphs that describe a set of variables and their conditional dependencies to each other. It is a graphical model, and we can easily check the conditional dependencies of the variables and their directions in a graph.
In this post, I we'll briefly learn how to use Bayesian networks with the bnlearn package in R. You may find more information about the package please refer to this page.

We'll start by loading the bnlearn package.

> library(bnlearn)

The bnlearn package implements the following learning algorithms.

Constrain-based: PC, GS, IAMB, MMPC, Hilton-PC
Score-based: Hill-Climbing, Tabu Search
Pairwise: ARACNE, Chow-Liu
Hybrid: MMHC, RSMAX2

We use score based learning algorithm, Hill-Climbing with hc() function. First, we'll generate simple dataset for this tutorial.

> print(simd)

     state element type  color accept
1   Target    long    A    red    yes
2   Target    long    A    red    yes
3   Target  normal    C yellow    yes
4   Target  normal    B  green    yes
5  Outlier   short    A    red     no
6   Target    long    B  green    yes
7  Outlier   short    A    red     no
8  Outlier    long    C yellow     no
9   Target    long    C yellow    yes
10 Outlier  normal    A    red     no

In this dataset, the 'state' has relationships with 'element' and 'accept' columns. And 'type' with 'color' column. When you create a data frame with categorical data, column should be a factor type. Otherwise, the data frame cannot be used in BN structure creation.

> str(simd)

'data.frame': 10 obs. of  5 variables:
 $ state  : Factor w/ 2 levels "Outlier","Target": 2 2 2 2 1 2 1 1 2 1
 $ element: Factor w/ 3 levels "long","normal",..: 1 1 2 2 3 1 3 1 1 2
 $ type   : Factor w/ 3 levels "A","B","C": 1 1 3 2 1 2 1 3 3 1
 $ color  : Factor w/ 3 levels "green","red",..: 2 2 3 1 2 1 2 3 3 2
 $ accept : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 1 2 1

Next, we'll create learning structure with hc() function.

> hc_simd <- hc(simd)

> class(hc_simd)

[1] "bn"

> hc_simd

  Bayesian network learned via Score-based methods

  model:
   [state][type][element|state][color|type][accept|state] 
  nodes:                                 5 
  arcs:                                  3 
    undirected arcs:                     0 
    directed arcs:                       3 
  average markov blanket size:           1.20 
  average neighbourhood size:            1.20 
  average branching factor:              0.60 

  learning algorithm:                    Hill-Climbing 
  score:                                 BIC (disc.) 
  penalization coefficient:              1.151292546 
  tests used in the learning procedure:  22 
  optimized:                             TRUE

We can see the structure in a plot.

> plot(hc_simd)

In this plot, state, element, accept, type, and color are called nodes. Directions between nodes are described in arcs that is a matrix containing a data of from-to elements direction.

> hc_simd$arcs

      from    to       
[1,] "type"  "color"  
[2,] "state" "accept" 
[3,] "state" "element"

As above arcs show that 'type' to 'color', and 'state' to 'accept' and 'element' relationships exist in our data. 'type' and 'state' are two independent groups that are not dependent on each other.

Next, we'll fit the model with data.

> simd_fitted <- bn.fit(hc_simd, data=simd)
> simd_fitted

  Bayesian network parameters

  Parameters of node state (multinomial distribution)

Conditional probability table:
 Outlier  Target 
    0.4     0.6 

  Parameters of node element (multinomial distribution)

Conditional probability table:
 
        state
element       Outlier       Target
  long   0.2500000000 0.6666666667
  normal 0.2500000000 0.3333333333
  short  0.5000000000 0.0000000000

  Parameters of node type (multinomial distribution)

Conditional probability table:
   A   B   C 
0.5 0.2 0.3 

  Parameters of node color (multinomial distribution)

Conditional probability table:
 
        type
color    A B C
  green  0 1 0
  red    1 0 0
  yellow 0 0 1

  Parameters of node accept (multinomial distribution)

Conditional probability table:
 
      state
accept Outlier Target
   no        1      0
   yes       0      1

Based on the above train data, we can perform conditional probability queries.

Let's say element="short", and accept="no", and in this condition we check state probability of "Outlier" and "Target". To query this condition we use cpquery() function.

> cpquery(simd_fitted,

           event=((element=="short")&(accept=="no")),

           evidence=(state=="Outlier"))
[1] 0.4856262834

state's probability of becoming "Outlier" is 48%.

> cpquery(simd_fitted,

            event=((element=="short")&(accept=="no")),

            evidence=(state=="Target"))
[1] 0

state's probability of becoming "Target" is 0%.

In this post, we've briefly learned how to apply Bayesian networks with bnlearn package in R.

DataTechNotes

Pages

Bayesian Network in R

No comments:

Post a Comment