Wednesday 6 November 2019

Count word frequency within text

The following Google Apps Script is designed to take a chunk of text (pasted into the Google Sheet) and then count how many times each word appears within that text, displaying the results in a hierarchical list in another sheet.

The script also makes use of Stopwords which are designed to be ignored from the main body of text and not counted, so like 'they', 'are', 'a' and 'the' are all commonly used English words which would unnecessarily be at the top of the results list each time. The list of Stopwords can be adjusted as required if you want to add/remove some.
Screenshot of example word count results
Screenshot of example word count results
getSpreadsheetData.gs
This function gets all of the initial data from the Google Sheet including the text we want to interrogate as one continuous string (in the example it is split over multiple rows) - we use 'toString' to achieve this. It then calls the next function to interrogate the data.

queryText.gs
This function ultimately goes through the text and counts each word in-turn, once it has cleared away some punctuation and ignored Stopwords. It makes use of a regular expression (regex) to clean up any unwanted punctuation which may interfere with our counter.

Regex breakdown -  /[a-zA-Z'’-]+/  - https://regex101.com/
  • /  /  start/end of regex.
  • [ ]  match any character in the set.
  • a-z  a single character in the range between a and z (case sensitive).
  • A-Z  a single character in the range between A and Z (case sensitive).
  • '’-  matches a single character in the list (in this example we have 2 types of apostrophe and a hyphen).
  • +  match between 1 and any number of items.
sortArray.gs
This Function implements sorting of the overall array before pasting the results into the Google Sheet. It means a hierarchical list of the words in descending order will be produced - as detailed in this blog post which explains the process a bit more.

Text query word frequency

No comments:

Post a Comment