This post explains how to extract text from Microsoft Word docx files using only built in actions in Power Automate. 3rd party actions exist, which are more probably more sophisticated and can certainly make this process easier.
docx files are actually zip files
The first thing to that is important to understand, is that a word docx file is actually a zip file that contains a number of folders and files. The root of the zip folder contains these files:

The word folder in the root of the zip file contains more files and folders:

Within the word folder, there is a file called document.xml (sometimes documentN.xml) which contains the actual document content, and this is the file which we will parse with Power Automate. My example word document looks like this:

The content of document.xml contains:
<?xml version="1.0" encoding="utf-8" standalone="yes"?> <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14"> <w:body> <w:p xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordml" w:rsidP="65D7FFCD" w14:paraId="2C078E63" wp14:textId="568EF955"> <w:pPr> <w:rPr> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> </w:pPr> <w:bookmarkStart w:name="_GoBack" w:id="0"/> <w:bookmarkEnd w:id="0"/> <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB"> <w:rPr> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> <w:t xml:space="preserve">How to extract </w:t> </w:r> <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB"> <w:rPr> <w:b w:val="1"/> <w:bCs w:val="1"/> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> <w:t xml:space="preserve">text </w:t> </w:r> <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB"> <w:rPr> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> <w:t>from a Microsoft Word docx file</w:t> </w:r> <w:r w:rsidRPr="65D7FFCD" w:rsidR="4ECA4038"> <w:rPr> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> <w:t>.</w:t> </w:r> </w:p> <w:p w:rsidR="65D7FFCD" w:rsidP="65D7FFCD" w:rsidRDefault="65D7FFCD" w14:paraId="77484B81" w14:textId="049555E8"> <w:pPr> <w:pStyle w:val="Normal"/> <w:rPr> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> </w:pPr> </w:p> <w:p w:rsidR="4ECA4038" w:rsidP="65D7FFCD" w:rsidRDefault="4ECA4038" w14:paraId="10A0E5FC" w14:textId="67D59CCA"> <w:pPr> <w:pStyle w:val="Normal"/> <w:rPr> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> </w:pPr> <w:r w:rsidRPr="65D7FFCD" w:rsidR="4ECA4038"> <w:rPr> <w:sz w:val="24"/> <w:szCs w:val="24"/> </w:rPr> <w:t>This document explains how to extract text from a Microsoft Word document using standard Power Automate actions. The result isn’t perfect, but it should be good enough for basic usage.</w:t> </w:r> </w:p> <w:sectPr> <w:pgSz w:w="12240" w:h="15840" w:orient="portrait"/> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/> <w:cols w:space="720"/> <w:docGrid w:linePitch="360"/> </w:sectPr> </w:body> </w:document>
As you can see from the above, the text data is on lines 18,27,34, 41 and 66 of the XML file.
Step 1 – Extract the contents of the Word document
To be able to access the content of document.xml the docx file needs to be extracted first. Use the flow action Extract archive to folder to extract the docx file to a temporary folder. Make sure you set the overwrite option to Yes.
Note: You will not be able to select the word document from the file browser within the action because it filters the available files to show only files with a .zip extension. So you can either:
- Rename the docx file to .zip
- Put in the file path manually or use dynamic content from a previous step
In my flow, the action looks like this:

Step 2 – Filter the output of the extraction
The output of the Extract archive to folder action is an array of objects which contains information about every file extracted from the archive. This output needs to be filtered so that we can get the file Id of the document.xml. So add a filter array action and use the output of Extract archive to folder as the input for the filter. Click the edit in advanced mode link and use this filter expression:
@and(startsWith(item()['Name'], 'document'),endsWith(item()['Name'], 'xml'))
This will filter the array and narrow it down to just the file containing the document contents. Here is how my filter array looks:

Step 3 – Get the file content of document.xml
Add a Get file content action and use this expression for the file:
first(body('Filter_array'))['Id']
It should look like this:

Step 4 – Grab the content of the text elements
Finally, add a compose action and use the following expresison:
xpath(xml(outputs('Get_file_content')?['body']), '//*[name()=''w:t'']/text()')
Here is how it looks in my flow:

The xpath expression will grab each element named w:t and return an array of strings of the content found in those elements. Click here If you’d like to learn more about the structure of a word docx file. The output from my sample document produced the following array:
[ "How to extract ", "text ", "from a Microsoft Word docx file", ".", "This document explains how to extract text from a Microsoft Word document using standard Power Automate actions. The result isn’t perfect, but it should be good enough for basic usage." ]
At this point you can either iterate through the results, or use a simple join expression to create a single string from the results. Here is a screenshot of the entire flow:

As you can see from the above, it is possible to Extract Text from a Word docx file with Power Automate quite easily, and a more sophisticated xpath expression could target specific regions of text required.
erfan says
Is the process similar if I would like to read the content and some particular textfield from an HTML file?
ARUN says
Hi Paulie,
could you please provide xpath lines to extract paragraph properties, paragraph Id, Text Font styles?
Rune Holm says
G R E A T! Works like a charm 😉
Bradley Brooks says
Instead of extracting text, how can I replace text (specifically a place holder in a template doc) and then re-zip it?
Micah says
@Bradley Brooks – I’m not certain on how to re-zip, but I’ve used the “replace” expression a few times and it has worked well.
Below is what I used to remove the [] and ” from the extracted text by using the “Compose” action. It would be added after the docxText action.
replace(replace(replace(string(outputs(‘docxText’)), ‘[‘, ”), ‘]’, ”), ‘”‘, ”)