Friday, July 05, 2013

Bayesian Network Inference with R and bnlearn


The Web Intelligence and Big Data course at Coursera had a section on Bayesian Networks. The associated programming assignment was to answer a couple of questions about a fairly well-known (in retrospect) Bayesian network called "asia" or "chest clinic". The approach illustrated in the course was to use SQL, which worked great, but I wanted to see if I could also do it using Python or R.

I did look at Python first, but I was looking for a package which was mature and well-documented, and I couldn't find one (suggestions welcome). In the R world, I found gRain and bnlearn. gRain looked promising initially, although the installation was somewhat dodgy. However, I found later that the version I had wouldn't let me apply evidence, ie setFinding() or setEvidence() seemed to have no effect, so I ditched it in favor of bnlearn.

The network is shown below. Each node in the network corresponds to a particular event and has probabilities associated with it. This is an example of a Bayesian Network that has built based on probabilities assigned by domain experts. You can read more about the asia network and Bayesian networks in general here.


The code below defines this network to bnlearn, and then applies constraints (or evidence) to the network to get the probabilities of the three diseases Tuberculosis (T), Lung Cancer (L) and Bronchitis (B).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
library(bnlearn)

set.seed(42)

# a BN using expert knowledge
net <- model2network("[A][S][T|A][L|S][B|S][E|T:L][X|E][D|B:E]")
yn <- c("yes", "no")
cptA <- matrix(c(0.01, 0.99), ncol=2, dimnames=list(NULL, yn))
cptS <- matrix(c(0.5, 0.5), ncol=2, dimnames=list(NULL, yn))
cptT <- matrix(c(0.05, 0.95, 0.01, 0.99), 
               ncol=2, dimnames=list("T"=yn, "A"=yn))
cptL <- matrix(c(0.1, 0.9, 0.01, 0.99), 
               ncol=2, dimnames=list("L"=yn, "S"=yn))
cptB <- matrix(c(0.6, 0.4, 0.3, 0.7), 
               ncol=2, dimnames=list("B"=yn, "S"=yn))
# cptE and cptD are 3-d matrices, which don't exist in R, so
# need to build these manually as below.
cptE <- c(1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0)
dim(cptE) <- c(2, 2, 2)
dimnames(cptE) <- list("E"=yn, "L"=yn, "T"=yn)
cptX <- matrix(c(0.98, 0.02, 0.05, 0.95), 
               ncol=2, dimnames=list("X"=yn, "E"=yn))
cptD <- c(0.9, 0.1, 0.7, 0.3, 0.8, 0.2, 0.1, 0.9)
dim(cptD) <- c(2, 2, 2)
dimnames(cptD) <- list("D"=yn, "E"=yn, "B"=yn)
net.disc <- custom.fit(net, dist=list(A=cptA, S=cptS, T=cptT, L=cptL, 
                                      B=cptB, E=cptE, X=cptX, D=cptD))

# Unit test: Given no evidence, the chances of tuberculosis is about 1%
cpquery(net.disc, (T=="yes"), TRUE)
cpquery(net.disc, (L=="yes"), TRUE)
cpquery(net.disc, (B=="yes"), TRUE)
# [1] 0.01084444
# [1] 0.05428889
# [1] 0.4501667

# Question 1:
# Patient has recently visited Asia and does not smoke. Which is most
# likely?
# (a) the patient is more likely to have tuberculosis then anything else.
# (b) the chance that the patient has lung cancer is higher than he/she 
#     having tuberculosis
# (c) the patient is more likely to have bronchitis then anything else
# (d) the chance that the patient has tuberculosis is higher than he/she 
#     having bronchitis
cpquery(net.disc, (T=="yes"), (A=="yes" & S=="no"))
cpquery(net.disc, (L=="yes"), (A=="yes" & S=="no"))
cpquery(net.disc, (B=="yes"), (A=="yes" & S=="no"))
# [1] 0.04988124
# [1] 0.00462963
# [1] 0.316092
# shows that (c) is correct.

# Question 2
# The patient has recently visited Asia, does not smoke, is not 
# complaining of dyspnoea, but his/her x-ray shows a positive shadow
# (a) the patient most likely has tuberculosis, but lung cancer is 
#     almost equally likely
# (b) the patient most likely has tuberculosis as compared to any of 
#     the other choices
# (c) the patient most likely has bronchitis, and tuberculosis is 
#     almost equally likely
# (d) the patient most likely has tuberculosis, but bronchitis is 
#     almost equally likely
cpquery(net.disc, (T=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
cpquery(net.disc, (L=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
cpquery(net.disc, (B=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
# [1] 0.2307692
# [1] 0.04166667
# [1] 0.2105263
# shows that (d) is correct

We then try to build the network from data, which is probably going to be the more common case. The bnlearn package contains the "asia" dataset, which we load as follows, then build a network and ask the same questions of it. It turns out that the answers are the same as the one built by domain experts.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# same BN using data
data(asia)
head(asia)
#    A   S   T  L   B   E   X   D
# 1 no yes  no no yes  no  no yes
# 2 no yes  no no  no  no  no  no
# 3 no  no yes no  no yes yes yes
# 4 no  no  no no yes  no  no yes
# 5 no  no  no no  no  no  no yes
# 6 no yes  no no  no  no  no yes

net.data <- bn.fit(hc(asia), asia)

# unit test
cpquery(net.data, (T=="yes"), TRUE)
cpquery(net.data, (L=="yes"), TRUE)
cpquery(net.data, (B=="yes"), TRUE)
# [1] 0.008564706
# [1] 0.066
# [1] 0.5077882

# question 1
cpquery(net.data, (T=="yes"), (A=="yes" & S=="no"))
cpquery(net.data, (L=="yes"), (A=="yes" & S=="no"))
cpquery(net.data, (B=="yes"), (A=="yes" & S=="no"))
# [1] 0.01630435
# [1] 0.02752294
# [1] 0.2978723
# still shows (c) is correct.

cpquery(net.data, (T=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
cpquery(net.data, (L=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
cpquery(net.data, (B=="yes"), (A=="yes" & S=="no" & D=="no" & X=="yes"))
# [1] 0.1
# [1] 0
# [1] 0.1666667
# still shows (d) is correct.

One thing I noticed is that if you make the same cpquery call twice in a row you get different answers. From what I understand about Bayesian Networks, I don't think this should happen. I am not very familiar with the R ecosystem, I couldn't find the appropriate R mailing list to ask about this, very likely I don't know where to look (if you do please let me know).

In case you need it, all the code in this post, as well as the SQL+Python based one I wrote for the original assignment, is available on my github page for this project.

15 comments (moderated to prevent spam):

Christel said...

Cool!

Javier said...

Nice post! I'm using bnlearn in a current project and I find it quite good. However, I'm not very satisfied with its inference engine. It uses sampling methods, so two consecutive calls to cpquery with the same arguments will likely return (more or less) slightly different results. This allows for arbitrarily complex queries, but is not the best method for simple cases. In your example can be seen that the probability for L=="yes" being A=="yes" and S=="no", which should be always 0.01, yields a result of 0.0046. Unfortunately, bnlearn does not seem to include any exact (and more restricted) inference algorithm.

Sujit Pal said...

Thanks Javi, both for your appreciation and for the clarification about bnlearn's sampling approach being the reason for variable results - I initially thought that perhaps I was doing something wrong :-).

Anonymous said...

Sujit,
Can you please post an R bnlearn example where a table of continuous values are read in, and a network is generated. Also, if possible, can you post an example of reading from a table of mixed discrete and continuous values. Thank you.
Jim Metz James.Metz@AbbVie.com

Sujit Pal said...

Hi Jim, this was my only attempt at using bnlearn :-). Not sure if bnlearn has a mailing list, you may get better responses there. I did find this discussion on Cross-Validated that addresses (part of) your question. I was just learning about Bayes Nets when I wrote the post and the only toolkit I found at the time was bnlearn on R. Since then I've come across libpgm on Python (although not experimented with it), which would be preferable for me (more comfortable with Python than R) if it has a similar feature set.

Unknown said...

Sujit, great post. Probably a stupid question, but how did you generate the plot combining the structure and the parameters? the plot function applied to the bnlearn objects only give me the structure.

many thanks

Sujit Pal said...

Thanks Ricardo. I think you will be disappointed with my answer, sorry about that, but the plot was generated by hand using dia :-).

Unknown said...

Sujit, no problem at all. I actually found another post where this was done: http://d.hatena.ne.jp/isseing333/20100618/1276849146

thanks

Sujit Pal said...

Cool, thanks for the link.

Jason said...

Hi guys, i found the best way to overcome the sampling inference method of bnlearn is to run simulation of large trials. By LLN, the average should converge to it's theoretical probability.

cpquery(net.disc, (L=="yes"), (A=="yes" & S=="no") will then be very close to 0.01

The way I do it:
for( i in 1:2000)
{
prob1[i] <- cpquery(net.disc, (B=="yes"), (A=="yes" & S == "no"))
prob2[i] <- cpquery(net.disc, (T=="yes"), (A=="yes" & S=="no"))
prob3[i] <- cpquery(net.disc, (L=="yes"), (A=="yes" & S=="no"))
}

list( Probabilities = c(mean(prob1), mean(prob2), mean(prob3)), Stdev = c(sqrt(var(prob1)), sqrt(var(prob2)), sqrt(var(prob3))))

Sujit Pal said...

Thanks Jason.

Unknown said...

I'm not sure why gRain isn't working for you. I have had no trouble setting findings and doing inference using gRain. However, gRain doesn't have structure learning. So I have used bnlearn for structure learning, and then used gRain for exact inference. You can transform between grain and bn.fit objects using the functions as.grain(x) and as.bn.fit(x) functions in the bnlearn package. These only work for discrete networks.

Sujit Pal said...

Hi Kathryn, I haven't tried it recently, so it is possible that this has been fixed since I tried it. Also entirely possible that I might have been doing something stupid. Thanks for the tips about transforming between gRain and bnlearn, I didn't know this.

Anonymous said...

To get the repeatable results, you might be able to try:
set.seed(1)
before each time you use cpquery()
it works with cpdist()

Sujit Pal said...

Thank you, I will remember that that next time I work with bnlearn. This would mean that the seed is being reset each time cpquery() is called, since I had set.seed() at the top of my code.