Recently I was looking at how I could implement an ML model in order to recommend some content based on the answers to some questions. As usual, it wasn't completely plain sailing, but I learnt quite a bit in the process.
Before I get to the issues, I'll explain a bit about the process so hang tight!
Consistent Input data
In order to be able to produce a meaningful output, it was important to ensure the input data was consistent. In order to do this we decided that the user would answer these questions as part of a questionnaire with a Likert scale provided for each one.
The answers were something like the following:
- Strongly disagree
- Disagree
- Neither agree nor disagree
- Agree
- Strongly agree
The point is, that by taking the sentiment from these values it's easy to convert to a number between 0 and 1.
Arriving at the output
Once the input was sorted, it was time to think about the output.
We decided that a decision tree could be a good way of getting to each recommendation, so created a CSV with 15 columns (11 for each question and 4 for each output). The next step involved creating some training data, which in simple terms means filling the CSV with some rows of data to train the model on which outputs to present for the given inputs.
Here is a simple representation of what the CSV looked like (there was about 100 rows in total). I remember reading somewhere that it's important not to over-train the model so this seemed like a good amount to get started at least.
k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, A, B, C, D
0.2, 0.4, 0.4, 0.6, 0.4, 0.5, 0.5, 0.667, 0.333, 0.333, 0.667, 0.333, 0, 0.667, 0.667
0, 0.2, 0.2, 0, 0.2, 0.25, 0.25, 0.667, 0.333, 1, 0.667, 0, 1, 0, 0.333
It's also worth noting, that whilst there are 11 inputs and 4 outputs, not each input was used in determining each output. In fact only a max of 5 inputs were used for any of the outputs.
The setup
Once I had everything ready it was time to choose the dev environment. Since I have been using Lambda functions for quite some time with a lot of success, I decided to pursue that route. After some research, the best option appeared to be to use python, as I know it's a popular language for data science and ML in general.
Eventually I came across the sci-kit learn
package. It had good reviews and seemed to do exactly what I was looking
for. The added bonus was that it was built on NumPy
and also meant I could use Pandas
for dealing with large amounts
of data (you will see wht this is relevant shortly).
After some initial faffing (not least because of my limited working experience with python), I managed to get everything setup and working. I've included a snippet of the code below, so you can get an understanding of what I was trying to achieve.
import json
import pandas as pd
import pathlib
from sklearn.tree import DecisionTreeRegressor
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Allow-Methods': 'POST',
}
def predict_output(data, y, features, input):
x = data[features]
model = DecisionTreeRegressor(random_state=1)
model.fit(x, y)
return model.predict(input[features])
def getInput(results):
return pd.DataFrame.from_dict(results)
def handler(event, context):
body = event
if 'body' in event:
body = json.loads(event.body)
missingParams = {
'statusCode': 400,
'headers': headers,
'body': json.dumps('Missing params')
}
keys = set(( 'k1', 'k2', 'k3', 'k4', 'k5', 'k6', 'k7', 'k8', 'k9', 'k10', 'k11' ))
if not body['results']:
return missingParams
for item in body['results']:
if not keys.issubset(item):
return missingParams
base = str(pathlib.Path().resolve())
file_path = base + '/src/sample_data.csv'
data = pd.read_csv(file_path)
input = getInput(body['results'])
keys1 = ['k3', 'k5']
res1 = predict_output(data, data.A, keys1, input)
keys2 = ['k1', 'k3', 'k4', 'k10', 'k11']
res2 = predict_output(data, data.B, keys2, input)
keys3 = ['k2', 'k3', 'k7', 'k8']
res3 = predict_output(data, data.C, keys3, input)
keys4 = ['k6', 'k9', 'k10']
res4 = predict_output(data, data.D, keys4, input)
output = []
for i in range(len(body['results'])):
output.append({
'A': res1[i],
'B': res2[i],
'C': res3[i],
'D': res4[i],
})
return {
'statusCode': 200,
'headers': headers,
'body': json.dumps({ 'output': output })
}
After a bit of tidying up, I was pretty happy. So as I would usually do I decided to push up my changes into Amplify and get this new function deployed in the staging environment. This is where things started to take a turn for the worst.
Issue after issue
Something which I haven't ever really considered too much are the size limits imposed in Lambda functions. Because I was
uploading using Amplify I was restricted to a package size of 250 MB (at the time of writing). Whilst this might sound
like a lot, when you actually start to look at how large these python packages are you very quickly start to run out of
space. When you add sci-kit learn
, NumPy
and Pandas
together, it was actually more than 300 MB so that was
definitely never going to work.
Undeterred, the next thing I decided to do was look into Lambda layers. This seemed promising at first because AWS has a
built-in layer for Scipy
which also included Numpy
. I then proceeded to spend an enormous amount of time searching
for sci-kit learn
and Pandas
, but never really managed to find a solution, so I quickly decided to abandon layers.
At this point, I was really beginning to wonder what I had got myself into, but as the sunk cost fallacy goes, I had put too much in to give up now. After more trawling I did manage to find some articles about compiling the packages from source and sort of half-heartedly attempted to do this, but low and behold because I have an Apple Silicon Mac, I got stuck with a compilation issue. I tried downloading iTerm 2 and running in Rosetta mode but this didn't work either so gave up in the end.
As one last ditch attempt to put this to bed I found a link to some paid lambda layers and did consider it temporarily but felt like it wasn't the best approach so decided to go back to the drawing board.
The breakthrough
The next day, more determined than ever I decided I needed to do something more radical. So radical in fact I was about to throw away all the code I had already written (albeit not a lot) and start again in node. I thought to myself, why am I trying to jump through all these hoops to use some giant libraries when I literally need a single algorithm (a decision tree) - so that I can train my model and run it as-a-service.
So I went to github and started to search for a library, eventually I did manage to find a couple (both are linked below) and was able to get it up and running. I did some testing and compared my results and was quite impressed with the output.
I would be an understatement to say I was elated. I was so happy I managed to break through this problem and in the end, solve it with such a small package.
Conclusion
Although this was at times very challenging, I found the whole process very enjoyable. Since I'm not a data science engineer, this is not something I come across on a regular basis, but it taught me a lot about limits in Lambda functions, how large these data science packages can be and even how much easier it is to work in node (well for me at least anyway).
I think next time, I would consider exploring the container based architecture more or even using a cloud based solution such as SageMaker - but at least I managed to get something working so that's the key take home point here.
Useful links