The Gift of Script: Bulk extract text from Google Docs for analysis

Tuesday, 5 May 2020

Bulk extract text from Google Docs for analysis

The following Google Apps Script is designed to go through a number of Google Docs and extract specific text from the body that represents an answer to a question. The tool was developed as a result of a Researcher needing to analyse hundreds of files that were each an answer to various survey questions. There were two specific sections that needed to be targeted and the content collated into a spreadsheet so further analysis could be performed.

Screenshot of Doc Analysis results in spreadsheet

The results from each Google Doc is added as a row into a Google Sheet, along with a link to the relevant file and its name. As the script loops through each file it calls a Function called infoGrabber that performs the task of extracting only the relevant text.

Once we get the Doc body as a single string we specify the piece of unique text that appears just before the answer to the question that we want:

var str = body.getText().toString();
var lookFor1 = "complete this section)?";

Next we create a start-value that looks for the position of the above text within the Doc body and then adds its length so it will start at the next character along - which will be the beginning of our answer text that we want to extract:

var startOffset1 = str.indexOf(lookFor1) + lookFor1.length;

We also need a finish-value which will be the start of the next chunk of unwanted text in the Doc (the next question in this example):

var endOffset1 = str.indexOf("EVERYTHING BELOW");

Now we can get all of the text inbetween the above two values and capture this within our spreadsheet:

section1 = str.substring( startOffset1, endOffset1);

This is repeated for the second question of text we want to capture.

Download

Bulk extract text from Google Docs for analysis download (please use 'Overview' > 'Make a copy' for your own version).

12 comments:

Unknown4 November 2020 at 12:39
This is very useful, when i run this code, it is overwriting the header rows and over wirting the my search texts
ReplyDelete
Replies
Menkashoo22 November 2020 at 04:54
Hi, great work, honestly.
One thing though, this is still erasing the first row for me.

I was able to collect also the folderName.

I was able to include the startOffset tag because it's only 2 charaters by using:
section1 = str.substring( startOffset1-2, endOffset1);

But I'm sure there is a better way to include the startOffset

And also, how would I go about getting a column with the file url that export the file into a PDF (same fileURL basically but ending ending with /export?format=pdf)?

Thanks

ReplyDelete
Replies
Menkashoo22 November 2020 at 05:01
And how would I go about reorganizing the columns in a different order?
ReplyDelete
Replies
Orrdan16 June 2021 at 14:19
Thank you for this! Absolutely wonderful!
ReplyDelete
Replies
Unknown12 August 2021 at 03:12
how do i get all the text from each file?
ReplyDelete
Replies
Haris9 October 2024 at 12:50
Hi, first this was very helpful second is there a way i can get the information to appear in the next column according to how many lines the text is in for example:
"I am
haris
hello"
to appear as
I am | haris | hello
and is there a way to get multiple data from different parts of the document.
Thanking you in advance.
ReplyDelete
Replies

Add comment

The Gift of Script

Pages

Tuesday, 5 May 2020

Bulk extract text from Google Docs for analysis

12 comments:

Get new posts by email: