NAIVE BAYES ALGORITHM IN SQL. PL/SQL, SPARK SQL

articles: 

##############################################
#cd /c/temp/;spark-submit.cmd pysparknaivebayes.py

"""

NAIVE BAYES ALGORITHM IN SQL. PL/SQL, SPARK SQL
Written: JP Vijaykumar
Date: Mar 8th 2022
********************************************************************************
This script is provided for educational purpose ONLY.

Readers discretion/ingenuity is advised, in making necessary changes/modifications to these
scripts as may be required for thier use in their respective environments.
The user assumes full responsibility while executing these scripts.

The sql scripts were tested in 11g & 19c databases.
While testing the sql scripts, pls test it in a dev/test db only and exercise caution.

During the scripts' testing, it failed many times, due to incompatibility of libraries/dependent softwares in spark/python.
It requires patience, time to debug the issue and install the dependent libraries/softwares' versions to make the spark/python code work.

One learns a lot, while debugging the code and resolving issues.
********************************************************************************

I created a csv file with test data for processing naive bayes algorithm in spark sql.

$ cat f:/data/DecisionTree3.csv
outlook,temperature,humidity,wind,playball
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rain,mild,high,weak,yes
rain,cool,normal,weak,yes
rain,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rain,mild,normal,weak,yes
sunny,mild,normal,strong,yes
overcast,mild,high,strong,yes
overcast,hot,normal,weak,yes
rain,mild,high,strong,no
sunny,hot,normal,weak,

"""

import pandas as pd
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NaiveBayes").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR") #ERROR / WARN / INFO #TO SUPRESS INFO MESSAGES

data = spark.read.load("f:/data/DecisionTree3.csv", format="csv", header=True, delimiter=",") #create DataFrame from csv file
print(data.columns)

data.show()

data.registerTempTable("decision_jp")
spark.catalog.cacheTable("decision_jp")

#make sure the "playball" column in csv file ls BLANK in the last line. No double quotes, nor single quotes

from pyspark.sql.functions import *
from pyspark.sql.window import Window

spark.conf.set("spark.sql.crossJoin.enabled", "true") #To enable cartesian product in sql

QUERY="""
with q as (select outlook,temperature,humidity,wind from decision_jp where playball is null),
num_rows as (select count(*) val from decision_jp where playball is not null),
num_pos as (select count(*) val from decision_jp where playball='yes'),
num_c1_pos as (select count(*) val from decision_jp a,q where a.outlook = q.outlook and a.playball='yes'),
num_c2_pos as (select count(*) val from decision_jp a,q where a.temperature = q.temperature and a.playball='yes'),
num_c3_pos as (select count(*) val from decision_jp a,q where a.humidity = q.humidity and a.playball='yes'),
num_c4_pos as (select count(*) val from decision_jp a,q where a.wind = q.wind and a.playball='yes'),
num_yes as (select (num_c1_pos.val/num_pos.val)*(num_c2_pos.val/num_pos.val)*(num_c3_pos.val/num_pos.val)*(num_c4_pos.val/num_pos.val)*
(num_pos.val/num_rows.val) val from num_c1_pos,num_c2_pos,num_c3_pos,num_c4_pos,num_pos,num_rows),
num_nos as (select count(*) val from decision_jp where playball='no'),
num_c1_nos as (select count(*) val from decision_jp a,q where a.outlook = q.outlook and a.playball='no'),
num_c2_nos as (select count(*) val from decision_jp a,q where a.temperature = q.temperature and a.playball='no'),
num_c3_nos as (select count(*) val from decision_jp a,q where a.humidity = q.humidity and a.playball='no'),
num_c4_nos as (select count(*) val from decision_jp a,q where a.wind = q.wind and a.playball='no'),
num_neg as (select (num_c1_nos.val/num_nos.val)*(num_c2_nos.val/num_nos.val)*(num_c3_nos.val/num_nos.val)*(num_c4_nos.val/num_nos.val)*
(num_nos.val/num_rows.val) val from num_c1_nos,num_c2_nos,num_c3_nos,num_c4_nos,num_nos,num_rows)
select round(num_yes.val/(num_yes.val + num_neg.val),3) as yes_val , round(num_neg.val/(num_yes.val + num_neg.val),3) as no_val from num_yes,num_neg
"""
spark.sql(QUERY).show()

spark.stop()

"""
#Referece:
#https://www.orafaq.com/node/3116
#https://www.youtube.com/watch?v=jS1CKhALUBQ --Naive Bayes theorem explanation
#https://livesql.oracle.com --sql and pl/sql scripts were tested on this site
#http://www.orafaq.com/node/3222 --To install spark and setup spark on my desktop.
#http://www.scribd.com/doc/240194459/One-Question-Four-Solutions --Explore as many solutions as possible

--output from the spark sql script's execution(copied from my gitbash console):

['outlook', 'temperature', 'humidity', 'wind', 'playball']
+--------+-----------+--------+------+--------+
| outlook|temperature|humidity| wind|playball|
+--------+-----------+--------+------+--------+
| sunny| hot| high| weak| no|
| sunny| hot| high|strong| no|
|overcast| hot| high| weak| yes|
| rain| mild| high| weak| yes|
| rain| cool| normal| weak| yes|
| rain| cool| normal|strong| no|
|overcast| cool| normal|strong| yes|
| sunny| mild| high| weak| no|
| sunny| cool| normal| weak| yes|
| rain| mild| normal| weak| yes|
| sunny| mild| normal|strong| yes|
|overcast| mild| high|strong| yes|
|overcast| hot| normal| weak| yes|
| rain| mild| high|strong| no|
| sunny| hot| normal| weak| null|
+--------+-----------+--------+------+--------+

+-------+------+
|yes_val|no_val|
+-------+------+
| 0.673| 0.327|
+-------+------+

I want to test Naive Bayes algorith in sql and pl/sql. In this way I can crosscheck the results obtained with spark sql output.

createda a table and inserted the rows from csv file. Since the data contained few rows only, I chose to insert the rows.
You may create an external table and load the data from csv file.

I used the oracle database on this site for my testing:
#https://livesql.oracle.com/

connect Saketh/Veeksha

drop table decision_jp;
create table decision_jp(outlook varchar2(10),temperature varchar2(10),humidity varchar2(10),wind varchar2(10),playball varchar2(10));

I used the below shell script to convert rows from my csv file into insert statements.
You can choose any other options to convert the rows into insert statements.

$ (cat f:/data/DecisionTree3.csv|
grep -v "wind" | awk '{print "insert into decision_jp values(#"$0"#);"}'|sed "s/#/'/g;s/,/','/g"
)

executed the following insert statements to populate the table decision_jp.

insert into decision_jp values('sunny','hot','high','weak','no');
insert into decision_jp values('sunny','hot','high','strong','no');
insert into decision_jp values('overcast','hot','high','weak','yes');
insert into decision_jp values('rain','mild','high','weak','yes');
insert into decision_jp values('rain','cool','normal','weak','yes');
insert into decision_jp values('rain','cool','normal','strong','no');
insert into decision_jp values('overcast','cool','normal','strong','yes');
insert into decision_jp values('sunny','mild','high','weak','no');
insert into decision_jp values('sunny','cool','normal','weak','yes');
insert into decision_jp values('rain','mild','normal','weak','yes');
insert into decision_jp values('sunny','mild','normal','strong','yes');
insert into decision_jp values('overcast','mild','high','strong','yes');
insert into decision_jp values('overcast','hot','normal','weak','yes');
insert into decision_jp values('rain','mild','high','strong','no');
insert into decision_jp values('sunny','hot','normal','weak','');

SQL CODE:
---------
with q as (select outlook,temperature,humidity,wind from decision_jp where playball is null),
num_rows as (select count(*) val from decision_jp where playball is not null),
num_pos as (select count(*) val from decision_jp where playball='yes'),
num_c1_pos as (select count(*) val from decision_jp a,q where a.outlook = q.outlook and a.playball='yes'),
num_c2_pos as (select count(*) val from decision_jp a,q where a.temperature = q.temperature and a.playball='yes'),
num_c3_pos as (select count(*) val from decision_jp a,q where a.humidity = q.humidity and a.playball='yes'),
num_c4_pos as (select count(*) val from decision_jp a,q where a.wind = q.wind and a.playball='yes'),
num_yes as (select (num_c1_pos.val/num_pos.val)*(num_c2_pos.val/num_pos.val)*(num_c3_pos.val/num_pos.val)*(num_c4_pos.val/num_pos.val)*
(num_pos.val/num_rows.val) val from num_c1_pos,num_c2_pos,num_c3_pos,num_c4_pos,num_pos,num_rows),
num_nos as (select count(*) val from decision_jp where playball='no'),
num_c1_nos as (select count(*) val from decision_jp a,q where a.outlook = q.outlook and a.playball='no'),
num_c2_nos as (select count(*) val from decision_jp a,q where a.temperature = q.temperature and a.playball='no'),
num_c3_nos as (select count(*) val from decision_jp a,q where a.humidity = q.humidity and a.playball='no'),
num_c4_nos as (select count(*) val from decision_jp a,q where a.wind = q.wind and a.playball='no'),
num_neg as (select (num_c1_nos.val/num_nos.val)*(num_c2_nos.val/num_nos.val)*(num_c3_nos.val/num_nos.val)*(num_c4_nos.val/num_nos.val)*
(num_nos.val/num_rows.val) val from num_c1_nos,num_c2_nos,num_c3_nos,num_c4_nos,num_nos,num_rows)
select round(num_yes.val/(num_yes.val + num_neg.val),3) as yes_val , round(num_neg.val/(num_yes.val + num_neg.val),3) as no_val from num_yes,num_neg

YES_VAL NO_VAL
.673 .327

Executed the following pl/sql script to process the table data.

PL/SQL CODE:
-----------
declare
v_pst varchar2(20):='';
v_pnm number:=0;
v_val number;
v_num number;
begin
for c0 in (select count(*) val from decision_jp where playball is not null) loop
for c1 in (select playball,count(*) val from decision_jp where playball is not null
group by playball order by 1) loop
v_val:=1;
v_num:=0;
for c2 in (select column_name from user_tab_columns where table_name='DECISION_JP'
and column_name != 'PLAYBALL' order by column_id) loop
execute immediate 'select count(*) from decision_jp a, decision_jp b
where a.playball='''||c1.playball||''' and b.playball is null and a.'||c2.column_name||'=b.'||c2.column_name||' '
into v_num;
--dbms_output.put_line(c0.val||' '||c1.playball||' '||c1.val||' '||c2.column_name ||' '||v_num);
v_val:=(v_val)*(v_num/c1.val);
end loop;
v_val:=v_val*(c1.val/c0.val);
if (v_pnm = 0) then
v_pnm:=v_val;
v_pst:=c1.playball;
else

dbms_output.put_line(v_pst||' '||round(v_pnm/(v_val + v_pnm),3));

dbms_output.put_line(c1.playball||' '||round(v_val/(v_val + v_pnm),3));
end if;
end loop;
end loop;
end;
/
Statement processed.
no .327
yes .673

processed the data in python using "MultinomialNB" library.

/c/anaconda3/python /c/pythonscripts/naivebayes3.py

outlook temperature humidity wind playball
0 sunny hot high weak no
1 sunny hot high strong no
2 overcast hot high weak yes
3 rain mild high weak yes
4 rain cool normal weak yes
5 rain cool normal strong no
6 overcast cool normal strong yes
7 sunny mild high weak no
8 sunny cool normal weak yes
9 rain mild normal weak yes
10 sunny mild normal strong yes
11 overcast mild high strong yes
12 overcast hot normal weak yes
13 rain mild high strong no
14 sunny hot normal weak NaN <--FINDING THE OUTCOME HERE

{'no': 1, 'yes': 2, 'nan': 0} <--ASSIGNED VALUES FOR OUTCOME: 'playball'
outlook temperature humidity wind
14 3 2 2 2 [2] <--CATEGORICAL VALUES WERE CONVERTED TO INTEGERS TO USE PYTHON LIBRARY

Happy Scripting.
"""