Tuesday, 13 July 2021

Extract text from multiple Google Docs into a Sheet

The following Google Apps Script is designed to iterate through Google Docs in a given Google Drive folder and extract the paragraphs of text along with a link to each file into a Google Sheet. A new column will automatically be appended for each paragraph.

This tool was designed with the vision that you may wish to centrally collate some comments/feedback written by others in Docs, into one central location so you do not need to open each file in turn.

Provide a Google Folder ID to extract text from Docs
Provide a Google Folder ID to extract text from Docs

Instructions

  1. Enter the ID of the Google Drive folder that contains all of the Google Docs into the box on the Welcome sheet.
  2. Click the Start button (or use the option from the Admin menu at the top) to run the tool.
  3. The Data sheet will then populate with file information, including the paragraphs of text. You can click the links to be taken directly to that Google Doc.
  4. The 'Log' sheet is designed to capture any error messages but these will also be displayed as a popup onscreen to help you troubleshoot any problems.

 

The Code

A file iterator is the method used to loop through each Google Doc in the Drive folder. Using this method we can 'getName()' and 'getUrl()' before then opening the file using the 'DocumentApp'.

From this point we can 'openByUrl()' then 'getBody()' which allows us to then 'getParagraphs()'. From here we need a loop as there may be multiple paragraphs from which we want to 'getText()'.

In the loop we check that the text is not empty (ie a blank paragraph) as we are not interested in adding blank columns to the Google Sheet. We also have a little counter (j in this instance) which moves along 1 column for every 1 paragraph of text added.


Download

Extract text from multiple Google Docs into a Sheet download here (please use 'File' > 'Make a copy' for your own version).


/*
Function to loop through each Google Doc in a single Drive folder, get all Paragraphs,
then get the Text and append to the 'Data' sheet. Adds a new Column for each Paragraph
if there is more than 1.
*/
function getParagraphs() {
// log start of script in sheet
logEvent('Starting script');
// set error variable flag to true as no problems
var noErrors = true;
// get spreadsheet
var ss = SpreadsheetApp.getActiveSpreadsheet();
// display Toast notification to inform user
ss.toast('Running ...', 'Script starting');
// get Welcome sheet
var welcomeSheet = ss.getSheetByName('Welcome');
// get Data sheet
var dataSheet = ss.getSheetByName('Data');
// get Google Drive Folder containing Files
try {
var folderID = welcomeSheet.getRange(5, 2).getValue();
var folder = DriveApp.getFolderById(folderID);
// get all Files in Google Drive Folder
var files = folder.getFiles();
}
catch (err) {
// log error in sheet
logEvent('Problem getting Google Drive Folder: ' + err);
// display user popup to inform of error
var popupTitle = 'Google Drive Folder error';
var popupMessage = 'Problem getting Google Drive Folder: ' + err;
htmlPopup(popupTitle, popupMessage);
// set error variable flag to false so rest of script stops
var noErrors = false;
}
// check there have been no errors before proceeding
if (noErrors) {
// loop through each File in turn ********************************
while (files.hasNext()) {
try {
// get last row of Data sheet to add information to
var dataLastRow = dataSheet.getLastRow();
// add 1 to last row to account for Header row when appending information
var dataRow = dataLastRow + 1;
// get the File
var file = files.next();
// get File name
var fileName = file.getName();
// get File Url
var fileUrl = file.getUrl();
// create hyperlink for adding to sheet later
var hyperlink = '=HYPERLINK("' + fileUrl + '","' + fileName + '")';
// append File information to Data sheet
dataSheet.getRange(dataRow, 1).setValue(fileName);
dataSheet.getRange(dataRow, 2).setValue(hyperlink);
// open the File and get all Paragraphs
var doc = DocumentApp.openByUrl(fileUrl);
var body = doc.getBody();
var paragraphs = body.getParagraphs();
// set the Column value which starts at 3
var column = 3;
// loop through each Paragraph and add to Data sheet as a new Column ************
for (var i = 0; i < paragraphs.length; i++) {
// get the text
var text = paragraphs[i].getText();
// check there is text otherwise ignore blank/empty Paragraphs
if (text == "") {
// no text, do not increase 'column' as do not want blank Column in sheet
}
else {
// there is text, add to sheet and increment 'column' to move along 1 Column
var paragraphCell = dataSheet.getRange(dataRow, column).setValue(text);
column++;
// set Column to wrap text to fit content better
paragraphCell.setWrap(true);
}
}
// loop through each Paragraph and add to Data sheet as a new Column ************
}
catch (err) {
// log error in sheet
logEvent('Problem iterating through File on Row ' + dataRow + ': ' + err);
// display user popup to inform of error
var popupTitle = 'File error on Row ' + dataRow;
var popupMessage = 'Problem with File: ' + err;
htmlPopup(popupTitle, popupMessage);
// set error variable to false so rest of script stops
var noErrors = false;
// break out of loop to stop script
break;
}
}
// loop through each File in turn ********************************
}
else {
// error getting Google Drive Folder containing Files
}
// check status of error variable flag to determine final user popup
if (noErrors) {
// log message in sheet
logEvent('Script completed without errors.');
// no errors have occurred in the script
var popupTitle = 'Script Complete';
var popupMessage = 'There were no errors.';
htmlPopup(popupTitle, popupMessage);
}
else {
// do nothing as the above 'catches' already display a user popup
// log message in sheet
logEvent('Script completed with errors.');
}
}
/*
Function to output information to the 'Log' sheet.
*/
function logEvent(action) {
// get the user running the script
var theUser = Session.getActiveUser().getEmail();
// get the relevant spreadsheet to output log details
var ss = SpreadsheetApp.getActiveSpreadsheet();
var logSheet = ss.getSheetByName('Log');
// create and format a timestamp
var dateTime = new Date();
var timeZone = ss.getSpreadsheetTimeZone();
var niceDateTime = Utilities.formatDate(dateTime, timeZone, "dd/MM/yy @ HH:mm:ss");
// create array of data for pasting into log sheet
var logData = [niceDateTime, theUser, action];
// append details into next row of log sheet
logSheet.appendRow(logData);
}
/*
Display a modal dialog box with custom HtmlService content.
Does not suspend the script.
*/
function htmlPopup(popupTitle, popupMessage) {
var htmlOutput = HtmlService
.createHtmlOutput(popupMessage)
.setWidth(360)
.setHeight(180);
SpreadsheetApp.getUi().showModalDialog(htmlOutput, popupTitle);
}
/*
Create Menu item for running script
*/
function onOpen() {
// add a custom menu to the spreadsheet
SpreadsheetApp.getUi()
.createMenu('Admin')
.addItem('Get Files', 'getFiles') // label for menu item, name of function to run.
.addToUi();
}

7 comments:

  1. Hi Scott

    Thank you!

    This is only designed to get the Document 'body' not other areas of the Google Doc. You may be able to do this with 'getHeader()' - https://developers.google.com/apps-script/reference/document/document?hl=en#getheader

    Kind regards
    Phil

    ReplyDelete
  2. Hi Phil! I tried using this but i keep getting this error: "Problem with File: Exception: The document is inaccessible. Please try again later." I've checked the permissions on the file and its set to "Anyone with link can edit" so Im not sure what else to try.

    ReplyDelete
    Replies
    1. Hi Jeff

      Hhhmmm, you're definitely entering the ID of the Drive 'folder' into the Google Sheet right?
      It's a Google Doc and not a Word Doc?
      Have you tried creating a fresh Google Doc in a folder with a simple bit of text in and trying that?

      Ta
      Phil

      Delete
  3. Hi Phil,

    I'm using documents with dropdown menu's but theyre not being recognised as text so its just skipping them, anything i can do to fix this?

    Thanks,

    ReplyDelete
    Replies
    1. Hi James

      As in Smart Chips? To my knowledge Google are still yet to release compatibility with Smart Chips and Apps Script - as it's affecting a number of other systems I would like to develop further.

      Ta

      Delete
    2. Dropdowns dont come under smart chips in the menu but yeah perhaps they're under the same limitations.

      Delete
    3. Rightio. Sorry I've not done a lot more with extracting Google Doc text as I find it very cumbersome and try to work with Spreadsheets where possible.

      Delete