Text Analysis Used by Kubrick Group to Drive Website Structure

Text Analysis, while not a new topic, has seen a dramatic rise in popularity and usefulness over the years, due to the rise of computational power and methods available. Kubrick Group explore one possible use case of text analysis to restructure files & documents on a public website.

For this project we used numerous Python libraries (Pandas, Scrapy, Natural-Language Toolkit, and Scikit-learn etc),Neo-4j graphical mapping, and Google Analytics to revolutionise the way data is used and viewed on their website. Personally, I found the project hugely beneficial as a way to consolidate all the knowledge I’ve gained throughout the technical training and also simulate working in an agile work-environment.” Jordan Calcutt – Kubrick Data Scientist

Key Achievements

  • Detailed mapping of all data sources & locations using web scraping
  • PDF text parsing using Python
  • Document clustering using un-supervised machine learning
  • Document hierarchy using text analysis & google analytics

The challenge

Through years of build-up, documents & data sources on the  website have become both disorganised and disparate, with many files not found under their logical directory.

This not only harms user experience, but also reduces the exposure and visibility to documents & information available on the website.

While performing this task on a small scale can be done manually, on a site which holds over 3000 documents over 1900 different webpages, this quickly becomes a challenge.

This work firstly involved building website and directory crawlers using the python library Scrapy, crawling through every webpage connected to the insurer’s directory, logging down the locations for each page.

From this, we can relate the locations of webpages to each other to create relationships of all the pages and file locations, which can then be mapped out using graphing database Neo4j. This results in a very thorough and visual understanding of the current website structure, as shown in Fig 1.

Fig 1: Map of entire website, built using Neo4j.

The next step was to create an intelligent document sorting system, which can cluster and file documents based on the content inside, using text analysis methods.

Firstly, using the PDF locations, python library PDFminer was used to parse the found PDF’s, over 2500 of them, pulling out all the text information from within.

An algorithm was built using Term Frequency-Inverse Document Frequency (TF-IDF) weighting (this text analysis method is available via Scikit Learn),which compares the prevalence of key words across documents to group similar documents together. This is shown in figure 2.

Fig 2: Term Frequency-Inverse Document Frequency example.

This was improved using Latent Dirichlet Allocation, a generative statistical model first proposed in 2003, which hypothesises that a document is made up of a number of topics and each topic is made up of a number of words. This modern text analysis approach concept is visualized in Figure 3.

Fig 3: A document is comprised of a number of topics (colours) and each topic is made up of a number of words, spread out over a Dirichlet distribution.

Once weighted and topic-defined, the documents could then be clustered using K-means clustering by calculating the cosine distance of the TF-IDF vectors to come up with a pre-defined number of clusters. An example of 15 clusters is shown in figure 4.

Fig 4: Documents and their clusters. Note the x and y-axis do not represent absolute measures, but are conversions to allow document distance to be visualized in 2–d space, see MDS for more information.

Once these clusters are located back to their original file location on the website, it is quickly clear that while most areas are well placed in terms of topic, other areas need to be restructured to make more logical sense. Some of these areas are highlighted in figure 5.

Fig 5: K-means clustered documents based on topic overlaid on original file location, majority of the website is clustered logically, however there are some discrepancies, as noted.

This still does not give a fully prescriptive method to how to better re-categorise documents and folders, so a document hierarchy based on document topic was created. This involves using Ward’s method to create sub-clusters. The results of this are shown in figure 6.

Fig 6: Example of document hierarchy created using Ward’s method , two main clusters have been defined but this can be user specified. Document names from the bottom have been omitted for data privacy reasons.

Google Analytics

By pairing this information with the page views data, more insightful views could be made in terms of choke points (areas with low traffic, that lead to high traffic) and loss points (points in the site where users are getting lost/ exiting after not finding what they are looking for),to be investigated by the marketing team. An example of choke point mapping is shown in Figure 7.

Fig 7: One of the choke maps produced, which highlight pages with higher views than their “parent” page, darker representing a bigger difference in views.


This analysis resulted in key areas of the websites identified for improvement focus, in addition to delivering a fuller picture of the current file environment.

From this, the marketing team are able to focus efforts on specific areas, driving customer experience in addition to driving important information out more clearly.

This exploratory piece of work also serves a more general purpose away from this specific use case, topic modelling & text analysis in the insurance sector can be used to identify document patterns, detect fraud, minimise pay out and process documents faster than any human can. This innovative use of data is what Big Data is all about, use of what you have intelligently, rather than the grand collection of masses of numbers.

This blog only highlights a select number of methods used in this project (mainly the cool ones!),for a more in-depth explanation of methods and processes used in text analysis projects, please contact Kubrick Group at alikokaz@kubrickgroup.com

Fig 2: Image Source

Fig 3: Image Source

by Ali Kokaz

Posted on August 11, 2017