Tuesday, 5 May 2020

Bulk extract text from Google Docs for analysis

The following Google Apps Script is designed to go through a number of Google Docs and extract specific text from the body that represents an answer to a question. The tool was developed as a result of a Researcher needing to analyse hundreds of files that were each an answer to various survey questions. There were two specific sections that needed to be targeted and the content collated into a spreadsheet so further analysis could be performed.
Screenshot of Doc Analysis results in spreadsheet
Screenshot of Doc Analysis results in spreadsheet

The results from each Google Doc is added as a row into a Google Sheet, along with a link to the relevant file and its name. As the script loops through each file it calls a Function called infoGrabber that performs the task of extracting only the relevant text.

Once we get the Doc body as a single string we specify the piece of unique text that appears just before the answer to the question that we want:
var str = body.getText().toString();
var lookFor1 = "complete this section)?";
Next we create a start-value that looks for the position of the above text within the Doc body and then adds its length so it will start at the next character along - which will be the beginning of our answer text that we want to extract:
var startOffset1 = str.indexOf(lookFor1) + lookFor1.length;
We also need a finish-value which will be the start of the next chunk of unwanted text in the Doc (the next question in this example):
var endOffset1 = str.indexOf("EVERYTHING BELOW");
Now we can get all of the text inbetween the above two values and capture this within our spreadsheet:
section1 = str.substring( startOffset1, endOffset1);
This is repeated for the second question of text we want to capture.

Bulk extract text from Google Docs for analysis

6 comments:

  1. This is very useful, when i run this code, it is overwriting the header rows and over wirting the my search texts

    ReplyDelete
    Replies
    1. Ah yes, thank you for spotting this. I have now updated the code the leave Row 1 alone.

      Kind regards
      Phil

      Delete
  2. Hi, great work, honestly.
    One thing though, this is still erasing the first row for me.

    I was able to collect also the folderName.

    I was able to include the startOffset tag because it's only 2 charaters by using:
    section1 = str.substring( startOffset1-2, endOffset1);

    But I'm sure there is a better way to include the startOffset

    And also, how would I go about getting a column with the file url that export the file into a PDF (same fileURL basically but ending ending with /export?format=pdf)?

    Thanks

    ReplyDelete
    Replies
    1. Hi Menkashoo

      I assume you've taken a new copy of the file since my post on 5 November where I've fixed the first row erasing problem?

      To get that PDF Url one way would be via a bit of concatenation so you add '/export?format=pdf' to the Google Doc Url (Line 23).

      Kind regards
      Phil

      Delete
  3. And how would I go about reorganizing the columns in a different order?

    ReplyDelete
    Replies
    1. Hi Menkashoo

      The columns match the format of the data in the array - so you would need to be comfortable understanding how arrays work in order to try and manipulate this (as it would need changing in a few places in the code).

      A simple alternative may be to just use an ImportRange query in another Google Sheet to pull the data through and arrange as you wish: https://support.google.com/docs/answer/3093340?hl=en-GB

      Kind regards
      Phil

      Delete