• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Tachytelic.net

  • Get in Touch
  • About Me

Extract text from Word docx files with Power Automate

May 30, 2021 by Paulie 5 Comments

This post explains how to extract text from Microsoft Word docx files using only built in actions in Power Automate. 3rd party actions exist, which are more probably more sophisticated and can certainly make this process easier.

docx files are actually zip files

The first thing to that is important to understand, is that a word docx file is actually a zip file that contains a number of folders and files. The root of the zip folder contains these files:

Image of the root folder of a word docx zip file

The word folder in the root of the zip file contains more files and folders:

Within the word folder, there is a file called document.xml (sometimes documentN.xml) which contains the actual document content, and this is the file which we will parse with Power Automate. My example word document looks like this:

The content of document.xml contains:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
  <w:body>
    <w:p xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordml" w:rsidP="65D7FFCD" w14:paraId="2C078E63" wp14:textId="568EF955">
      <w:pPr>
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
      </w:pPr>
      <w:bookmarkStart w:name="_GoBack" w:id="0"/>
      <w:bookmarkEnd w:id="0"/>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB">
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t xml:space="preserve">How to extract </w:t>
      </w:r>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB">
        <w:rPr>
          <w:b w:val="1"/>
          <w:bCs w:val="1"/>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t xml:space="preserve">text </w:t>
      </w:r>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB">
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t>from a Microsoft Word docx file</w:t>
      </w:r>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="4ECA4038">
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t>.</w:t>
      </w:r>
    </w:p>
    <w:p w:rsidR="65D7FFCD" w:rsidP="65D7FFCD" w:rsidRDefault="65D7FFCD" w14:paraId="77484B81" w14:textId="049555E8">
      <w:pPr>
        <w:pStyle w:val="Normal"/>
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
      </w:pPr>
    </w:p>
    <w:p w:rsidR="4ECA4038" w:rsidP="65D7FFCD" w:rsidRDefault="4ECA4038" w14:paraId="10A0E5FC" w14:textId="67D59CCA">
      <w:pPr>
        <w:pStyle w:val="Normal"/>
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
      </w:pPr>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="4ECA4038">
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t>This document explains how to extract text from a Microsoft Word document using standard Power Automate actions. The result isn’t perfect, but it should be good enough for basic usage.</w:t>
      </w:r>
    </w:p>
    <w:sectPr>
      <w:pgSz w:w="12240" w:h="15840" w:orient="portrait"/>
      <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>

As you can see from the above, the text data is on lines 18,27,34, 41 and 66 of the XML file.

Step 1 – Extract the contents of the Word document

To be able to access the content of document.xml the docx file needs to be extracted first. Use the flow action Extract archive to folder to extract the docx file to a temporary folder. Make sure you set the overwrite option to Yes.

Note: You will not be able to select the word document from the file browser within the action because it filters the available files to show only files with a .zip extension. So you can either:

  • Rename the docx file to .zip
  • Put in the file path manually or use dynamic content from a previous step

In my flow, the action looks like this:

Image of a word document being extracted by Power Automate

Step 2 – Filter the output of the extraction

The output of the Extract archive to folder action is an array of objects which contains information about every file extracted from the archive. This output needs to be filtered so that we can get the file Id of the document.xml. So add a filter array action and use the output of Extract archive to folder as the input for the filter. Click the edit in advanced mode link and use this filter expression:

@and(startsWith(item()['Name'], 'document'),endsWith(item()['Name'], 'xml'))

This will filter the array and narrow it down to just the file containing the document contents. Here is how my filter array looks:

Step 3 – Get the file content of document.xml

Add a Get file content action and use this expression for the file:

first(body('Filter_array'))['Id']

It should look like this:

Step 4 – Grab the content of the text elements

Finally, add a compose action and use the following expresison:

xpath(xml(outputs('Get_file_content')?['body']), '//*[name()=''w:t'']/text()')

Here is how it looks in my flow:

The xpath expression will grab each element named w:t and return an array of strings of the content found in those elements. Click here If you’d like to learn more about the structure of a word docx file. The output from my sample document produced the following array:

[
  "How to extract ",
  "text ",
  "from a Microsoft Word docx file",
  ".",
  "This document explains how to extract text from a Microsoft Word document using standard Power Automate actions. The result isn’t perfect, but it should be good enough for basic usage."
]

At this point you can either iterate through the results, or use a simple join expression to create a single string from the results. Here is a screenshot of the entire flow:

Image of a Power Automate Flow that extracts the text content from a Word docx file

As you can see from the above, it is possible to Extract Text from a Word docx file with Power Automate quite easily, and a more sophisticated xpath expression could target specific regions of text required.

Filed Under: Power Platform Tagged With: Power Automate

Reader Interactions

Comments

  1. erfan says

    September 15, 2021 at 7:54 pm

    Is the process similar if I would like to read the content and some particular textfield from an HTML file?

  2. ARUN says

    January 17, 2022 at 12:05 pm

    Hi Paulie,

    could you please provide xpath lines to extract paragraph properties, paragraph Id, Text Font styles?

  3. Rune Holm says

    February 4, 2022 at 10:48 pm

    G R E A T! Works like a charm 😉

  4. Bradley Brooks says

    February 9, 2022 at 5:34 pm

    Instead of extracting text, how can I replace text (specifically a place holder in a template doc) and then re-zip it?

  5. Micah says

    May 3, 2022 at 3:36 pm

    @Bradley Brooks – I’m not certain on how to re-zip, but I’ve used the “replace” expression a few times and it has worked well.
    Below is what I used to remove the [] and ” from the extracted text by using the “Compose” action. It would be added after the docxText action.

    replace(replace(replace(string(outputs(‘docxText’)), ‘[‘, ”), ‘]’, ”), ‘”‘, ”)

Leave a Reply Cancel reply

Primary Sidebar

Link to my LinkedIn Profile
Buy me a coffee
Image link to all Power Automate content

Excellent Power Automate Blogs

  • Damien Bird
  • Dennis (Expiscornovus)
  • Tom Riha

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 246 other subscribers.

Go to mobile version