Machine Learning in Sitecore – Is it real?

BINGO! It’s all the buzz., it’s cool.  It’s machine learning. But is there real world use for Machine Learning for organizations that aren’t an Amazon or Microsoft?  Honestly I don’t know, but the potential looks amazing, particularly for organizations that have a lot of data to leverage.

The organization I work for is one such company.  In fact, one of the most difficult issues we faced was deciding what data we could (and should) use.   We devised a Proof Of Concept (PoC) that would explore the capabilities of integrating Machine Learning with Sitecore XP and give us measurable outcomes as to it’s success.  With Cortex being announced at Symposium last year, the promise of a Sitecore delivered solution is on the horizon, however there is little detail on the specifics.  Running our PoC gives us a pre-cursor and hopefully contributes to the business case to implementation of Cortex or another solution in future.

Our overall goal was to:

  • Prove we can integrate some Machine Learning technology into Sitecore
  • Create metrics that will allow us to measure, optimise and verify outcomes
  • Get it to market quickly as a PoC
  • Ensure data security given we were dealing with sensitive information

As of right now, our PoC has been in production running with real users for a few months.  We’re still learning and iterating to optimise the outcomes.  Unfortunately I’m limited by a few factors in sharing specific code examples publicly, but this is how we approached it.

Solution Design

We put together a small panel from internal and partner team members to quickly design a solution that would meet our goals outlined above.  Consulting with business and technical stakeholders, we mapped out what we thought was the best path forward.

The concept was to create a recommendation engine that could be integrated into Sitecore, allowing authors to add a component to the page displaying personalised content.  The content would be in the form of a recommendation featuring products the ML results predicted may be of interest to the end user.

There were two main data sets available to us to build a “profile” of existing users, that we could use to train and test the ML models.

Firstly, basic demographic information. E.G. Gender, age band, postcode (In AU ours are at a suburb/regional level) .  These data points combined with some overarching categorizations provided by the Australian Bureau of Statistics, gave a us solid, but well anonymised profile of the user.  This data set contained well over a million subjects to train and test models with.

Secondly was the product holding data.  This mapped out which users currently held which products. To simplify the project we ran this on a subset of 15 products.  All of which are subscription style products rather than physical goods.

Without delving too far into the details (next sections!), we planned to implement a flow that looked a little like this:

Untitled Diagram (1).png

  1. Extract data (albeit manually) from our Enterprise data warehouse.
  2. Upload to Azure ML Studio
  3. Run training and testing against the ML model (to determine accuracy)
  4. Decide on a model that offered best results through statistical relevance and human sanity checks
  5. Expose the results
  6. Create a Sitecore rendering that could consume the results service
  7. Run content tests against known control variants
  8. Re-assess and optimise test content on a regular basis.

Machine Learning with Azure Studio

Up front, I am by no means a Data Scientist.  As it stands, this project absolutely needed some expertise in this area and looked to our partner to provide insights in this area.  We were particularly needing some extra expertise surrounding:

  • Preparing data sets
  • Algorithm selection
  • Evaluation of results
  • General data manipulation techniques

While Azure ML Studio does make it easy for non-Data Scientists to get started, I found there was much to learn and gained some valuable insights in what (not) to do in certain situations.  That said, they do offer some great documentation to get started.

Hold on tight…..here we go.  When preparing the data sets we wanted to ensure that the distribution of the training set was reflective of the actual data set.  It stands to reason that a training data set that closely represents the characteristics of the actual data will yield more accurate results.  Running a T-Test on data sets will give an indication of the means, variances and a p-value which indicates the possibility of a random variable falling within the expected norm (IE. Is your data reasonably consistent and does the test data fit well). This was run across the demographic distributions as well as individual products, looking for significant variations.  What I did learn here is that data set preparation involves a lot of trial and error and copying and pasting (Note to self: Order new C and V keys).  Rinse, repeat; Create the data sets,  re-running tests and comparing.  Eventually we landed on a training data set we were confident had a reflective distribution of the full set.

While Azure ML Studio users can write and maintain their own algorithms, there are also a bunch available out of the box. Which may get you just a drag and drop away from having a successful model.  We assessed a number of algorithms, but for our purposes and given the tight time frame, we settled on the “Matchbox recommender” which is commonly used in recommendation engines.

Microsoft has developed a large-scale recommender system based on a probabilistic model (Bayesian) called Matchbox. This model can learn about a user’s preferences through observations made on how they rate items, such as movies, content, or other products. Based on those observations, it recommends new items to the users when requested.

The inputs would be the demographic data as “users”, the product holdings as “ratings” for the product items and of course some product metadata for each line.  Using these inputs, we were able to create a trained model upon which we could perform predictive experiments.  Hooray!

ev.jpg
Matchbox recommender training flow

Well. Almost Hooray, we still needed to confirm that our results were reflective of something that a) is statistically accurate and b) passes a human “sniff test”.  For a), luckily ML studio has an  “Evaluate Recommender” module you can feed inputs of scored results (in your predictive model) and a test data set for comparison.  Once run this will allow you to right click on the output port of the recommender module to visualise the evaluation results.   This will give a Normalized Discounted Cumulative Gain (NDCG) value.  This is a measure of ranking quality.

The evaluation metric Normalized Discounted Cumulative Gain (NDCG) is estimated from the ground truth ratings given in the test set. Its value ranges from 0.0 to 1.0, where 1.0 represents the most ideal ranking of the entities.

So, get close to 1…..and you’re good to go!

ev (1)
Predictive experiment flow

During development we explored 2 ways of exposing data to the service developed in Sitecore.

  • Creating a secure web service endpoint in ML Studio that would accept the input parameters, then respond with a scored result set.  Unfortunately these requests we found to be slower than expected.
  • Using an input of the full data set, outputting results for all users that could be imported into Sitecore .  This would allow us to provide fast access to “snapshots” of recommendations, but with some manual overhead and the possibility of stale data if it wasn’t regularly updated.

Both had pros and cons, but given this was a time boxed Proof of Concept we wanted to ensure there was no performance impact. So we implemented the latter option.

Measuring success

Being able to measure the success of the initiative was one of the core goals of the PoC.  We needed unequivocal evidence that the strategy improved or did not improve key metrics on the site.  For this we wanted to ensure a few things:

  • We had control data sets in place, so we have a baseline for comparison.
  • We were measuring goals and engagement values relevant to the exercise
  • Metric indicators for engagement were recorded in different ways (eg. Time on site in Google Analytics and Sitecore’s trailing visit engagement value).  We really only had one shot to run this PoC, so wanted to cover as many bases as possible and compare trends across the board.

To ensure that we had a baseline and were able to compare the ML results against the norm we implemented a recommendation rendering with similar layout & design, but 3 separate data sources:

  1. The ML results in a personalised context to the user viewing the page
  2. An array of products curated in the CMS by content authors
  3. A random selection of products

We could then use the content testing features in Sitecore XP to deliver an A/B/n test on selected placements.  This allows for comparative analysis on Sitecore goals, path analyser and engagement value. We also enabled some dynamic metric gathering in Google Analytics to back up the data collected in Sitecore and give us some very specific page level stats.

Configuring A/B/n content tests using different datasources

 

A/B/n tests were scheduled, with some checkpoints to stop tests, analyse, optimise content and test hypotheses that we thought may enhance the User Experience based on the data.  We were looking at things link, adjusting CTAs, imagery, layout & supporting copy.  In all cases it was imperative that all variants had similar changes at the same time to keep from skewing any test results.

Integration with Sitecore

Now with a scored data set to work with, and a clear idea of the other data sets required in the recommendations, we needed to map all of this into a format selected content authors could manage and optimise (as above).  We already had a product tile component that content authors were able to configure to display any number of curated products.  To keep things familiar and nicely componentised, we were able to quickly extend the rendering to use a different “service” for retrieving data depending on data source settings.  We added settings to the Sitecore templates that flagged which data source “type” was being used and which data service to use.  Additionally it allowed us to implement custom business logic in the data service, which retrieved the ML results.  Just because a result may be statistically relevant, doesn’t necessarily mean the business would want to encourage some purchases (Eg. recommending a subscription of lesser value than one the user already had).   We had to account for these situations given this was going to have real world business impact.

This approach allowed us to leverage existing knowledge as content authors could set up everything including the tests in Experience Editor, while still allowing for the flexibility required to meet the goals and custom business rules.

A/B/n Content tests in Experience Editor

Results

So this is the crux of it, eh?  Did it work?  Well, sorry, but we can’t draw any conclusive evidence one way or the other.  It’s just too early.  We will continue to analyse and optimise the results.  After which, I’m sure our analytics team will delve into the results further to identify trends and perhaps things we could have done better.

Once the PoC is complete, we’ll be tearing it down (Noooooo!), but that is the nature of a PoC).  Depending on the outcomes there may be a fully fledged version or perhaps we’ll have more information on other solutions that may be better suited (I’m looking at you Cortex).  We just don’t know until that time comes ¯\_(ツ)_/¯.

That said, using ML for this sort of marketing tooling looks very promising and we can definitely cross off the goals we set out to achieve.

Update:

Great success! The PoC period was extended to gather some more results and confirm initial indicators.  The overall results were phenomenal.  So much so that the project is now the subject of a case study published by Sitecore.  A big shout out to all that helped on the project.