Tuesday, 5 May 2020

Bulk extract text from Google Docs for analysis

The following Google Apps Script is designed to go through a number of Google Docs and extract specific text from the body that represents an answer to a question. The tool was developed as a result of a Researcher needing to analyse hundreds of files that were each an answer to various survey questions. There were two specific sections that needed to be targeted and the content collated into a spreadsheet so further analysis could be performed.
Screenshot of Doc Analysis results in spreadsheet
Screenshot of Doc Analysis results in spreadsheet

The results from each Google Doc is added as a row into a Google Sheet, along with a link to the relevant file and its name. As the script loops through each file it calls a Function called infoGrabber that performs the task of extracting only the relevant text.

Once we get the Doc body as a single string we specify the piece of unique text that appears just before the answer to the question that we want:
var str = body.getText().toString();
var lookFor1 = "complete this section)?";
Next we create a start-value that looks for the position of the above text within the Doc body and then adds its length so it will start at the next character along - which will be the beginning of our answer text that we want to extract:
var startOffset1 = str.indexOf(lookFor1) + lookFor1.length;
We also need a finish-value which will be the start of the next chunk of unwanted text in the Doc (the next question in this example):
var endOffset1 = str.indexOf("EVERYTHING BELOW");
Now we can get all of the text inbetween the above two values and capture this within our spreadsheet:
section1 = str.substring( startOffset1, endOffset1);
This is repeated for the second question of text we want to capture.

Bulk extract text from Google Docs for analysis

No comments:

Post a comment