Apriori Algorithm in sql,pl/sql and spark sql

articles: 

"""
Apriori Algorithm in sql,pl/sql and spark sql
Written JP Vijaykumar
Date Sep 14 2021

********************************************************************************
This script is provided for educational purpose only.

The sql scripts were tested in 11g & 19c databases.
While testing the sql scripts, pls test it in a dev/test db only and exercise precaution.

Readers discretion/ingenuity is advised, in making necessary changes/modifications to these
scripts as may be required for thier use in their respective environments.
The user assumes full responsibility while executing these scripts.
********************************************************************************
"""
#spark-submit.cmd python/pysparkapriori.py
import time

start_time = time.time()
"""
Previously I presented a document with the title "Apriori Algorithm in pl/sql", wherein I used
for loops/if then else logic to process data in pl/sql and fetch frequent patterns from a
transactional dataset.

Pls note, I am not implementing support,confidence calculation in my scripts.
The scripts just pull the frequent item sets from the transactional dataset.

I was curious as to how to pull frequent pattern data from transactional data, just using sql commands.
Without using any programatical logic.

Fortunately, I implemented table joins/sql commands in this script to accomplish the purpose.
The same logic was also implemented in Oracle table's data to pull frequent pattern data.
I presented the sql scripts at the end of this article.

This table join logic works fine on smaller datsets to process data to find frequent patterns.
It is taking more time and not efficient to process large amounts of data with this option.
using python supplied libraries, large amounts of data can be processed in less time, efficiently.

I am sure, in Oracle, the tables' join logic can be fine tuned using hints, tables' join order.

In spark sql, listagg function is not available, instead, a similar option "collect_list" was used.

Created the file apriori_spark.csv, with the following data and saved it in my local folder "f:/data".

tran_id,item
1,curd
1,apple
1,fruit
1,lamp
1,pot
1,rose
1,jam
2,spoon
2,detergent
2,glue
2,hat
2,tape
2,umbrella
2,jam
3,jam
4,curd
4,apple
4,fruit
4,pot
4,rose
4,tape
5,fruit
5,banana
5,spoon
5,hat
5,ink
5,lamp
5,umbrella
6,curd
6,detergent
6,glue
6,hat
6,ink
6,jam
7,fruit
7,glue
7,tape
8,spoon
8,ink
8,rose
8,umbrella
9,curd
9,banana
9,ink
10,apple
10,spoon
10,ink
10,umbrella
10,jam

"""
#spark sql code starts here################################################################
from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName("Apriori").setMaster("local[*]")
sc = SparkContext(conf=conf)

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("apriori").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR") #ERROR / WARN / INFO #TO SUPRESS INFO MESSAGES
from pyspark.sql.functions import *

df = spark.read.format("csv").option("infer_schema","false").option("header","true").option("sep",",").load("f:/data/apriori_spark.csv")
print(df.columns)
print(df.describe())

print(df.count(),len(df.columns)) #equivalent to python df.shape
df.show()
tranCount = df.count()

print(tranCount)

df.createOrReplaceTempView("data")

spark.sql("select tran_id,count(*) num_products from data group by tran_id having count(*)>1 order by tran_id").show()
###apriori
#concat row values into comma separated line
df = spark.sql("""with t as (select tran_id,count(*) from data
group by tran_id having count(*) > 1 order by tran_id)
select t.tran_id,lower(item) item from data,t
where data.tran_id = t.tran_id order by 1,2""")
df.show()

#similar to listagg function in Oracle
df.groupby('tran_id').agg(collect_list('item').alias('item')).createOrReplaceTempView("item_agg")
spark.sql("select item,count(*) from item_agg group by item order by 2 desc").show(60,False) #with this option, "show" will not chop column length in display

#convert array type to str type using concat_ws in spark sql and search for frequent patterns
spark.sql("""with a as (select distinct item item from data order by 1),
t2 as (select t1.item item1,t2.item item2 from a t1, a t2 where t1.item < t2.item order by 1,2)
select t2.item1,t2.item2,count(*) from t2,item_agg
where concat_ws(' ',item_agg.item) like '%'||t2.item1||'%'
and concat_ws(' ',item_agg.item) like '%'||t2.item2||'%'
group by t2.item1,t2.item2 having count(*)>1 order by 3 desc,1,2""").show(80,False)
print(time.time() - start_time) #check time taken for the script to run
#spark sql code ends here################################################################
"""
When I run the above script with the following command
spark-submit.cmd python/pysparkapriori.py
I get the output as shown below:

['tran_id', 'item']
DataFrame[summary: string, tran_id: string, item: string]
49 2
+-------+---------+
|tran_id| item|
+-------+---------+
| 1| curd|
| 1| apple|
| 1| fruit|
| 1| lamp|
| 1| pot|
| 1| rose|
| 1| jam|
| 2| spoon|
| 2|detergent|
| 2| glue|
| 2| hat|
| 2| tape|
| 2| umbrella|
| 2| jam|
| 3| jam|
| 4| curd|
| 4| apple|
| 4| fruit|
| 4| pot|
| 4| rose|
+-------+---------+
only showing top 20 rows

49
+-------+------------+
|tran_id|num_products|
+-------+------------+
| 1| 7|
| 10| 5|
| 2| 7|
| 4| 6|
| 5| 7|
| 6| 6|
| 7| 3|
| 8| 4|
| 9| 3|
+-------+------------+

+-------+---------+
|tran_id| item|
+-------+---------+
| 1| apple|
| 1| curd|
| 1| fruit|
| 1| jam|
| 1| lamp|
| 1| pot|
| 1| rose|
| 10| apple|
| 10| ink|
| 10| jam|
| 10| spoon|
| 10| umbrella|
| 2|detergent|
| 2| glue|
| 2| hat|
| 2| jam|
| 2| spoon|
| 2| tape|
| 2| umbrella|
| 4| apple|
+-------+---------+
only showing top 20 rows

+--------------------------------------------------+--------+
|item |count(1)|
+--------------------------------------------------+--------+
|[apple, ink, jam, spoon, umbrella] |1 |
|[banana, fruit, hat, ink, lamp, spoon, umbrella] |1 |
|[apple, curd, fruit, pot, rose, tape] |1 |
|[banana, curd, ink] |1 |
|[apple, curd, fruit, jam, lamp, pot, rose] |1 |
|[detergent, glue, hat, jam, spoon, tape, umbrella]|1 |
|[ink, rose, spoon, umbrella] |1 |
|[fruit, glue, tape] |1 |
|[curd, detergent, glue, hat, ink, jam] |1 |
+--------------------------------------------------+--------+

+---------+--------+--------+
|item1 |item2 |count(1)|
+---------+--------+--------+
|spoon |umbrella|4 |
|ink |spoon |3 |
|ink |umbrella|3 |
|apple |curd |2 |
|apple |fruit |2 |
|apple |jam |2 |
|apple |pot |2 |
|apple |rose |2 |
|banana |ink |2 |
|curd |fruit |2 |
|curd |ink |2 |
|curd |jam |2 |
|curd |pot |2 |
|curd |rose |2 |
|detergent|glue |2 |
|detergent|hat |2 |
|detergent|jam |2 |
|fruit |lamp |2 |
|fruit |pot |2 |
|fruit |rose |2 |
|fruit |tape |2 |
|glue |hat |2 |
|glue |jam |2 |
|glue |tape |2 |
|hat |ink |2 |
|hat |jam |2 |
|hat |spoon |2 |
|hat |umbrella|2 |
|ink |jam |2 |
|jam |spoon |2 |
|jam |umbrella|2 |
|pot |rose |2 |
+---------+--------+--------+

Here I will be generating the Apriori Algorithm in sql using tables' joins.

connect veeksha/saketh

drop table apriori;
create table apriori(tran_id number,item varchar2(30));
truncate table apriori;

The csv file can be read into an Oracle external table and process data.
Instead, I preferred to insert the data into a table using insert commands.
You may load the csv file via sql*loader option also.

insert into apriori values(1,'curd');
insert into apriori values(1,'apple');
insert into apriori values(1,'fruit');
insert into apriori values(1,'lamp');
insert into apriori values(1,'pot');
insert into apriori values(1,'rose');
insert into apriori values(1,'jam');
insert into apriori values(2,'spoon');
insert into apriori values(2,'detergent');
insert into apriori values(2,'glue');
insert into apriori values(2,'hat');
insert into apriori values(2,'tape');
insert into apriori values(2,'umbrella');
insert into apriori values(2,'jam');
insert into apriori values(3,'jam');
insert into apriori values(4,'curd');
insert into apriori values(4,'apple');
insert into apriori values(4,'fruit');
insert into apriori values(4,'pot');
insert into apriori values(4,'rose');
insert into apriori values(4,'tape');
insert into apriori values(5,'fruit');
insert into apriori values(5,'banana');
insert into apriori values(5,'spoon');
insert into apriori values(5,'hat');
insert into apriori values(5,'ink');
insert into apriori values(5,'lamp');
insert into apriori values(5,'umbrella');
insert into apriori values(6,'curd');
insert into apriori values(6,'detergent');
insert into apriori values(6,'glue');
insert into apriori values(6,'hat');
insert into apriori values(6,'ink');
insert into apriori values(6,'jam');
insert into apriori values(7,'fruit');
insert into apriori values(7,'glue');
insert into apriori values(7,'tape');
insert into apriori values(8,'spoon');
insert into apriori values(8,'ink');
insert into apriori values(8,'rose');
insert into apriori values(8,'umbrella');
insert into apriori values(9,'curd');
insert into apriori values(9,'banana');
insert into apriori values(9,'ink');
insert into apriori values(10,'apple');
insert into apriori values(10,'spoon');
insert into apriori values(10,'ink');
insert into apriori values(10,'umbrella');
insert into apriori values(10,'jam');

commit;

select * from apriori;
TRAN_ID ITEM
---------- ------------------------------
1 curd
1 apple
1 fruit
1 lamp
1 pot
1 rose
1 jam
2 spoon
2 detergent
2 glue
2 hat
2 tape
2 umbrella
2 jam
3 jam
4 curd
4 apple
4 fruit
4 pot
4 rose
4 tape
5 fruit
5 banana
5 spoon
5 hat
5 ink
5 lamp
5 umbrella
6 curd
6 detergent
6 glue
6 hat
6 ink
6 jam
7 fruit
7 glue
7 tape
8 spoon
8 ink
8 rose
8 umbrella
9 curd
9 banana
9 ink
10 apple
10 spoon
10 ink
10 umbrella
10 jam

49 rows selected.

Generating the Apriori Algorithm in sql using the query:
Here I am using "listagg" Oracle function to pull all items pertaining to a tran_id into one comma separated line.
A similar function is utilized in spark sql, while processing the data using apriori algorithm.

with t1 as (select tran_id,item from apriori order by 1,2),
t2 as (select tran_id,listagg(item,',') within group(order by item ) as item from t1 group by tran_id order by tran_id),
t3 as (select distinct item from apriori order by item),
t4 as (select a1.item item1,a2.item item2 from t3 a1,t3 a2 where a1.item < a2.item)
select item1, item2,count(*) from t4,t2
where t2.item like '%'||t4.item1||'%'||t4.item2||'%'
group by t4.item1,t4.item2 having count(*)>1 order by 3 desc,1,2;

ITEM1 ITEM2 COUNT(*)
spoon umbrella 4
ink spoon 3
ink umbrella 3
apple curd 2
apple fruit 2
apple jam 2
apple pot 2
apple rose 2
banana ink 2
curd fruit 2
curd ink 2
curd jam 2
curd pot 2
curd rose 2
detergent glue 2
detergent hat 2
detergent jam 2
fruit lamp 2
fruit pot 2
fruit rose 2
fruit tape 2
glue hat 2
glue jam 2
glue tape 2
hat ink 2
hat jam 2
hat spoon 2
hat umbrella 2
ink jam 2
jam spoon 2
jam umbrella 2
pot rose 2

32 rows selected.

Verified the final results from spark-sql script output and sql query output and they are matching.

Here I used the 19c feature of creating a function with "with" clause inside my select query and pulled the data:

WITH
FUNCTION get_str(id IN NUMBER) RETURN varchar2 IS
v_str varchar2(200):='';
BEGIN
for c1 in (select item from apriori where tran_id = id order by 1) loop
v_str:=v_str||' '||c1.item;
end loop;
RETURN v_str;
END;
t1 as (SELECT get_str(tran_id) item
FROM (select distinct tran_id from apriori order by tran_id)),
t2 as (select distinct item item from apriori order by item),
t3 as (select a1.item item1,a2.item item2 from t2 a1,t2 a2 where a1.item < a2.item)
select t3.item1, t3.item2,count(*) from t3,t1
where t1.item like '%'||t3.item1||'%'||t3.item2||'%'
group by t3.item1,t3.item2 having count(*)>1 order by 3 desc,1,2;

ITEM1 ITEM2 COUNT(*)
spoon umbrella 4
ink spoon 3
ink umbrella 3
apple curd 2
apple fruit 2
apple jam 2
apple pot 2
apple rose 2
banana ink 2
curd fruit 2
curd ink 2
curd jam 2
curd pot 2
curd rose 2
detergent glue 2
detergent hat 2
detergent jam 2
fruit lamp 2
fruit pot 2
fruit rose 2
fruit tape 2
glue hat 2
glue jam 2
glue tape 2
hat ink 2
hat jam 2
hat spoon 2
hat umbrella 2
ink jam 2
jam spoon 2
jam umbrella 2
pot rose 2

32 rows selected.

I don't like using complex aggregate functions or the 19c WITH CLAUSE FUNCTION feature to process the transaction data
and find frequent patterns using apriori algorithm.

Can I just use the transactional table's data, as it is and produce the frequent patterns.
The following is the simplest solution to produce frequent patterns using apriori algorithm.

with t1 as (select tran_id,item from apriori order by 1,2)
select a1.item,a2.item, count(*) from t1 a1,t1 a2
where a1.tran_id = a2.tran_id and a1.item < a2.item
group by a1.item,a2.item having count(*)> 1 order by 3 desc,1,2

ITEM ITEM COUNT(*)
spoon umbrella 4
ink spoon 3
ink umbrella 3
apple curd 2
apple fruit 2
apple jam 2
apple pot 2
apple rose 2
banana ink 2
curd fruit 2
curd ink 2
curd jam 2
curd pot 2
curd rose 2
detergent glue 2
detergent hat 2
detergent jam 2
fruit lamp 2
fruit pot 2
fruit rose 2
fruit tape 2
glue hat 2
glue jam 2
glue tape 2
hat ink 2
hat jam 2
hat spoon 2
hat umbrella 2
ink jam 2
jam spoon 2
jam umbrella 2
pot rose 2

32 rows selected.

Pls note, the three above sql queries are producing identical results, and they are matching.

To find frequent patterns of three items bought together in my transactional table, I use the following modified query.

with t1 as (select tran_id,item from apriori order by 1,2)
select a1.item,a2.item, a3.item,count(*) from t1 a1,t1 a2,t1 a3
where a1.tran_id = a2.tran_id and a2.tran_id = a3.tran_id and a1.item < a2.item and a2.item < a3.item
group by a1.item,a2.item,a3.item having count(*)> 1 order by 4 desc,1,2,3;

ITEM ITEM ITEM COUNT(*)
ink spoon umbrella 3
apple curd fruit 2
apple curd pot 2
apple curd rose 2
apple fruit pot 2
apple fruit rose 2
apple pot rose 2
curd fruit pot 2
curd fruit rose 2
curd pot rose 2
detergent glue hat 2
detergent glue jam 2
detergent hat jam 2
fruit pot rose 2
glue hat jam 2
hat spoon umbrella 2
jam spoon umbrella 2

17 rows selected.

Sql is very powerful and you can do most complex joins and solve complicated logic.

To find frequent patterns of four items bought together in my transactional table, I use the following modified query.

with t1 as (select tran_id,item from apriori order by 1,2)
select a1.item,a2.item, a3.item,a4.item,count(*) from t1 a1,t1 a2,t1 a3,t1 a4
where a1.tran_id = a2.tran_id and a2.tran_id = a3.tran_id and a3.tran_id = a4.tran_id and a1.item < a2.item
and a2.item < a3.item and a3.item < a4.item
group by a1.item,a2.item,a3.item,a4.item having count(*)> 1 order by 5 desc,1,2,3,4

ITEM ITEM ITEM ITEM COUNT(*)
apple curd fruit pot 2
apple curd fruit rose 2
apple curd pot rose 2
apple fruit pot rose 2
curd fruit pot rose 2
detergent glue hat jam 2

6 rows selected.

This simplified tables' join is used to find frequent pattern of 4 items sold together.
This simplified tables' join can be used in a package to find the frequent pattern of n items sold together in a dataset.

I processed the complete logic in the following pl/sql script.
The output is not sorted as it was with sql queries' output as shown above in my sql queries.
Though the sql queries are complex,the amount of coding is less as compared to pl/sql scripting.

set serverout on size 1000 linesize 200
declare
v_num number:=0;
v_str varchar2(200):='';
v_stt varchar2(100):='';
v_nut number:=0;
begin
for c0 in (with t as (select distinct item from apriori order by 1)
select t1.item item1,t2.item item2 from t t1, t t2
where t1.item < t2.item) loop
v_num:=0;

for c1 in (select tran_id,item from apriori order by 1,2) loop
if (v_str is null and v_nut = 0) then
v_str:=c1.item;
v_stt:=c1.item;
v_nut:=c1.tran_id;
elsif (v_str is not null and v_nut = c1.tran_id and v_stt <> c1.item) then
v_str:=v_str||','||c1.item;
v_stt:=c1.item;
elsif (v_nut <> c1.tran_id and v_nut <> 0) then

if ( v_str like '%'||c0.item1||'%'||c0.item2||'%') then
v_num:=v_num+1;

end if;
v_nut:=c1.tran_id;
v_stt:=c1.item;
v_str:=c1.item;
end if;
end loop;

if ('%'||c0.item1||'%'||c0.item2||'%' like v_str) then
v_num:=v_num+1;

end if;
if (v_num > 1) then
dbms_output.put_line(v_num||' -- '||c0.item1||' '||c0.item2);
end if;
end loop;
end;

Statement processed.
2 -- apple curd
2 -- apple fruit
2 -- apple jam
2 -- apple pot
2 -- apple rose
2 -- banana ink
2 -- curd fruit
2 -- curd ink
2 -- curd jam
2 -- curd pot
2 -- curd rose
2 -- detergent glue
2 -- detergent hat
2 -- detergent jam
2 -- fruit lamp
2 -- fruit pot
2 -- fruit rose
2 -- fruit tape
2 -- glue hat
2 -- glue jam
2 -- glue tape
2 -- hat ink
2 -- hat jam
2 -- hat spoon
2 -- hat umbrella
2 -- ink jam
3 -- ink spoon
3 -- ink umbrella
2 -- jam spoon
2 -- jam umbrella
2 -- pot rose
4 -- spoon umbrella

Pls note, if we are mining for frequent patterns for three or more items together, some of the above scripts' logic changes.
Most of these scripts holds good for frequent pattern search of two items in a given dataset.

Pls note, I had chosen my transactional table with items' details in multiple rows, one item per row.
If we need to process the data of multiple items from a single row, then the scripts need to be modified accordingly.

Next steps:
I tested these scripts with a smaller dataset.
Bench marking all the available options on a larger dataset and choosing the best method.
Fine tuning the sql queries to perform better, using hints and modifying the tables' join order etc.
Implement code change to generate the support,confidence numbers.
I implemented the sql query with listagg option in spark sql. Using other query options in spark sql and generate output.
Using pl/sql in spark sql is not supported. Convert the pl/sql logic into hpl/sql and use it inside spark sql to generate the output.

Happy scripting.

Reference:
http://www.orafaq.com/node/3138
http://www.orafaq.com/node/3222 To install spark and setup spark on my desktop.
https://livesql.oracle.com/ I tested my Apriori Algorithm's sql scripts here.
https://dzone.com/articles/apriori-itemset-generation-with-oracle-sql
https://programming.vip/docs/data-mining-apriori-association-analysis-to-reproduce-the-myth-of-beer-diapers.html

"""