This link has the steps to make Google apps script do OCR on the PDF -
const blob = DriveApp.getFileById(fileID).getBlob();
const resource = {
title: blob.getName(),
mimeType: blob.getContentType()
};
const options = {
ocr: true,
ocrLanguage: "en"
};
// Convert the pdf to a Google Doc with ocr.
const file = Drive.Files.insert(resource, blob, options);
But this generally gave pretty terrible results for me, and the formatting was completely lost.
For my use case, found that the PDF was being created from HTML, so direct conversion from HTML to GDoc gave good results -
assethtml += contentdata;
var ablob = Utilities.newBlob(assethtml, MimeType.HTML, "asset.html");
var AssetGDocId = Drive.Files.insert(
{ title: 'The name of the document',
mimeType: MimeType.GOOGLE_DOCS, parents: [{"id": destFolderID}] },
ablob ).id;
No comments:
Post a Comment