Screenshot of Doc Analysis results in spreadsheet |
The results from each Google Doc is added as a row into a Google Sheet, along with a link to the relevant file and its name. As the script loops through each file it calls a Function called infoGrabber that performs the task of extracting only the relevant text.
Once we get the Doc body as a single string we specify the piece of unique text that appears just before the answer to the question that we want:
var str = body.getText().toString();Next we create a start-value that looks for the position of the above text within the Doc body and then adds its length so it will start at the next character along - which will be the beginning of our answer text that we want to extract:
var lookFor1 = "complete this section)?";
var startOffset1 = str.indexOf(lookFor1) + lookFor1.length;We also need a finish-value which will be the start of the next chunk of unwanted text in the Doc (the next question in this example):
var endOffset1 = str.indexOf("EVERYTHING BELOW");Now we can get all of the text inbetween the above two values and capture this within our spreadsheet:
section1 = str.substring( startOffset1, endOffset1);
This is repeated for the second question of text we want to capture.
Download
Bulk extract text from Google Docs for analysis download (please use 'Overview' > 'Make a copy' for your own version).
This is very useful, when i run this code, it is overwriting the header rows and over wirting the my search texts
ReplyDeleteAh yes, thank you for spotting this. I have now updated the code the leave Row 1 alone.
DeleteKind regards
Phil
Hi, great work, honestly.
ReplyDeleteOne thing though, this is still erasing the first row for me.
I was able to collect also the folderName.
I was able to include the startOffset tag because it's only 2 charaters by using:
section1 = str.substring( startOffset1-2, endOffset1);
But I'm sure there is a better way to include the startOffset
And also, how would I go about getting a column with the file url that export the file into a PDF (same fileURL basically but ending ending with /export?format=pdf)?
Thanks
Hi Menkashoo
DeleteI assume you've taken a new copy of the file since my post on 5 November where I've fixed the first row erasing problem?
To get that PDF Url one way would be via a bit of concatenation so you add '/export?format=pdf' to the Google Doc Url (Line 23).
Kind regards
Phil
And how would I go about reorganizing the columns in a different order?
ReplyDeleteHi Menkashoo
DeleteThe columns match the format of the data in the array - so you would need to be comfortable understanding how arrays work in order to try and manipulate this (as it would need changing in a few places in the code).
A simple alternative may be to just use an ImportRange query in another Google Sheet to pull the data through and arrange as you wish: https://support.google.com/docs/answer/3093340?hl=en-GB
Kind regards
Phil
Thank you for this! Absolutely wonderful!
ReplyDeleteThank you Orrdan!
DeleteKind regards
Phil
how do i get all the text from each file?
ReplyDeleteHi
DeleteYou might find this blog post more useful: https://www.pbainbridge.co.uk/2021/07/extract-text-from-multiple-google-docs.html
Kind regards
Phil