đ Extract: Word Document Content Extraction
Powerful extraction tools to analyze and extract text, tables, metadata, styles, and content from Word documents. Perfect for data analysis, content migration, and automation workflows.
âšī¸ Overview
Understanding Word document extraction capabilities
What is Word Document Extraction?
Extraction allows you to analyze and extract structured data from Word documents. DOCX Studio provides two powerful interfaces:
REST API
For developers to programmatically extract data
- Automated extraction workflows
- Multiple language support
- System integration ready
- Batch processing capability
DOCX Studio
Visual interface for interactive extraction
- Drag-and-drop file upload
- Interactive data preview
- One-click downloads
- No coding required
Key Capabilities
- 6 Extraction Modes: Flexible options from simple to detailed with content and metadata variations
- Content Filtering: Filter by paragraphs, tables, headers, footers, footnotes, and more
- Multiple Output Formats: JSON, YAML, TOML, CSV for different use cases
- Style Extraction: Extract document styles, fonts, and formatting information
- Table Extraction: Preserve table structure with merged cell handling
- Batch Processing: Extract multiple content types in one operation
- API Parity: All studio features available via REST API with 6 language examples
â Extraction Features
Text Extraction
Full document text with formatting preservation
- Paragraphs with style information
- Heading hierarchy detection
- Font, size, and color preservation
- Bold, italic, underline tracking
- Structured output by section
Table Extraction
Preserve structure and formatting of tables
- Cell data and formatting
- Merged cells handling
- Column width preservation
- Row and column spans
- CSV and Excel export
Style Extraction
Capture document styles and themes
- Named styles and custom styles
- Font families and sizes
- Color schemes and themes
- Paragraph spacing and alignment
- List and numbering formats
Metadata Extraction
Document properties and statistics
- Author and creation date
- Word count and page count
- Document revision history
- Custom document properties
- File size and format details
Content Analysis
Deep analysis of document structure
- Section and heading structure
- Hyperlink extraction
- Cross-reference mapping
- Table of contents analysis
- Footnote and endnote extraction
Batch Processing
Process multiple documents efficiently
- Multi-file upload support
- Consistent output structure
- Parallel processing capability
- Combined result downloads
- Progress tracking per file
đ§ Extraction Modes
Choose how much data to extract based on your needs
Mode 1: Detailed
Full extraction with complete document properties, including all paragraphs, tables, styles, metadata, and formatting information.
Best for: Complete content analysis, document migration, full reconstruction
Mode 2: Simple
Basic extraction with essential information only - just the core text and tables without extended properties.
Best for: Quick analysis, reducing file size, core content only
Mode 3: Metadata
Extract only document metadata and properties - author, dates, statistics, and custom properties.
Best for: Document cataloging, compliance checks, audit trails
Mode 4: Content
Extract primary content - paragraphs, headings, and text elements without deep formatting details.
Best for: Content migration, text analysis, search indexing
Mode 5: Tables Only
Extract only table data from the document, preserving structure and cell values.
Best for: Data extraction, spreadsheet conversion, tabular analysis
Mode 6: All
Complete extraction of everything - text, tables, metadata, styles, images, headers, footers, and all document elements.
Best for: Full archival, complete document reconstruction, comprehensive analysis
đģ API Usage Examples
Learn how to extract content from Word documents using the API in your preferred programming language
Extract Document Examples
# Extract with detailed mode, organized by content type
curl -X POST https://powerfile.io/docx/api/extract \
-H "Authorization: YOUR_API_TOKEN" \
-F "file=@document.docx" \
-F "mode=detailed" \
-F "file_format=json"
# Output structure: extracted_data/
# âââ content/
# â âââ paragraphs.json
# â âââ headings.json
# â âââ full_text.txt
# âââ tables/
# â âââ table_metadata.json
# â âââ table_1.csv
# â âââ table_2.csv
# âââ metadata/
# â âââ document_properties.json
# â âââ statistics.json
# âââ styles/
# âââ styles.json
# âââ fonts.json
import requests
import json
# Configure API endpoint and authentication
url = "https://powerfile.io/docx/api/extract"
headers = {"Authorization": "YOUR_API_TOKEN"}
# Open and upload the Word document
with open("document.docx", "rb") as f:
files = {"file": ("document.docx", f, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")}
data = {
"mode": "detailed", # detailed, simple, metadata, content, tables, all
"file_format": "json", # json, yaml, toml, csv
}
response = requests.post(url, headers=headers, files=files, data=data)
if response.status_code == 200:
result = response.json()
print("â
Extraction successful!")
print(f"Output directory: {result['output_dir']}")
print(f"Paragraphs extracted: {result['paragraphs_count']}")
print(f"Tables extracted: {result['tables_count']}")
print(f"Styles found: {result['styles_count']}")
else:
print(f"â Error: {response.status_code}")
print(response.json())
const fs = require('fs');
const FormData = require('form-data');
const axios = require('axios');
// Configure API endpoint and authentication
const url = 'https://powerfile.io/docx/api/extract';
const apiKey = 'YOUR_API_KEY';
// Create form data with file and parameters
const form = new FormData();
form.append('file', fs.createReadStream('document.docx'));
form.append('mode', 'detailed'); // detailed, simple, metadata, content, tables, all
form.append('file_format', 'json'); // json, yaml, toml, csv
// Make the API request
axios.post(url, form, {
headers: {
...form.getHeaders(),
'Authorization': apiKey
}
}).then(response => {
console.log('â
Extraction successful!');
console.log('Output directory:', response.data.output_dir);
console.log('Paragraphs:', response.data.paragraphs_count);
console.log('Tables:', response.data.tables_count);
console.log('Styles:', response.data.styles_count);
}).catch(error => {
console.error('â Error:', error.response?.data || error.message);
});
import java.io.File;
import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.mime.MultipartEntityBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
// Configure API endpoint and authentication
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpPost uploadFile = new HttpPost("https://powerfile.io/docx/api/extract");
uploadFile.setHeader("Authorization", "YOUR_API_TOKEN");
// Build multipart form data
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addBinaryBody(
"file",
new File("document.docx"),
ContentType.create("application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
"document.docx"
);
builder.addTextBody("mode", "detailed"); // detailed, simple, metadata, content, tables, all
builder.addTextBody("file_format", "json"); // json, yaml, toml, csv
HttpEntity multipart = builder.build();
uploadFile.setEntity(multipart);
// Execute request
try (CloseableHttpResponse response = httpClient.execute(uploadFile)) {
String responseBody = EntityUtils.toString(response.getEntity());
JsonObject result = JsonParser.parseString(responseBody).getAsJsonObject();
if (response.getStatusLine().getStatusCode() == 200) {
System.out.println("â
Extraction successful!");
System.out.println("Output: " + result.get("output_dir").getAsString());
System.out.println("Paragraphs: " + result.get("paragraphs_count").getAsInt());
System.out.println("Tables: " + result.get("tables_count").getAsInt());
} else {
System.err.println("â Error: " + response.getStatusLine());
}
} catch (IOException e) {
e.printStackTrace();
}
using System;
using System.IO;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;
// Configure API endpoint and authentication
var client = new HttpClient();
client.DefaultRequestHeaders.Add("Authorization", "YOUR_API_TOKEN");
// Create multipart form content
var content = new MultipartFormDataContent();
using (var fileStream = File.OpenRead("document.docx"))
{
var streamContent = new StreamContent(fileStream);
streamContent.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue(
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
);
content.Add(streamContent, "file", "document.docx");
content.Add(new StringContent("detailed"), "mode"); // detailed, simple, metadata, content, tables, all
content.Add(new StringContent("json"), "file_format"); // json, yaml, toml, csv
// Make the API request
var response = await client.PostAsync("https://powerfile.io/docx/api/extract", content);
var responseBody = await response.Content.ReadAsStringAsync();
if (response.IsSuccessStatusCode)
{
var result = JsonSerializer.Deserialize(responseBody);
Console.WriteLine("â
Extraction successful!");
Console.WriteLine($"Output directory: {result.GetProperty(\"output_dir\").GetString()}");
Console.WriteLine($"Paragraphs: {result.GetProperty(\"paragraphs_count\").GetInt32()}");
Console.WriteLine($"Tables: {result.GetProperty(\"tables_count\").GetInt32()}");
Console.WriteLine($"Styles: {result.GetProperty(\"styles_count\").GetInt32()}");
}
else
{
Console.WriteLine($"â Error: {response.StatusCode}");
Console.WriteLine(responseBody);
}
}
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"mime/multipart"
"net/http"
"os"
)
func main() {
// Open the Word document
file, err := os.Open("document.docx")
if err != nil {
panic(err)
}
defer file.Close()
// Create multipart form data
body := &bytes.Buffer{}
writer := multipart.NewWriter(body)
part, _ := writer.CreateFormFile("file", "document.docx")
io.Copy(part, file)
// Add form fields
writer.WriteField("mode", "detailed") // detailed, simple, metadata, content, tables, all
writer.WriteField("file_format", "json") // json, yaml, toml, csv
writer.Close()
// Create and configure request
req, _ := http.NewRequest("POST", "https://powerfile.io/docx/api/extract", body)
req.Header.Set("Authorization", "YOUR_API_TOKEN")
req.Header.Set("Content-Type", writer.FormDataContentType())
// Execute request
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
// Parse response
var result map[string]interface{}
json.NewDecoder(resp.Body).Decode(&result)
if resp.StatusCode == 200 {
fmt.Println("â
Extraction successful!")
fmt.Println("Output:", result["output_dir"])
fmt.Println("Paragraphs:", result["paragraphs_count"])
fmt.Println("Tables:", result["tables_count"])
fmt.Println("Styles:", result["styles_count"])
} else {
fmt.Printf("â Error: %d\n", resp.StatusCode)
}
}
đ¨ DOCX Studio Examples
Visual interface for interactive extraction
Getting Started in DOCX Studio Extraction
- Navigate to DOCX Studio Extraction Tab
- Upload Word document:
- Click or drag and drop a .docx or .doc file (Max 50MB)
- Supports single or multiple file uploads
- Select an Extraction Mode:
- Detailed - Full extraction with all properties
- Simple - Basic extraction with minimal information
- Metadata - Document properties and statistics only
- Content - Primary text content and structure
- Tables Only - Extract only table data
- All - Complete extraction of everything
- Choose Output Format: JSON, YAML, TOML, or CSV (impacts file structure and readability)
- Click Extract: Process your file with selected options
- Download Results: Individual files, summary, or complete ZIP archive
- View API Examples: Switch to the API tab to see ready-to-use code in your preferred language
Studio Interface Components
- File Upload Section: Drag-and-drop or click to upload Word documents (DOCX, DOC)
- Extraction Mode Selector: 6 mode buttons for different extraction scenarios
- Output Format Buttons: Choose between JSON, YAML, TOML, or CSV formats
- API Code Examples Tab: Live code examples in cURL, Python, JavaScript, Java, C#, and Go
- Real-time API Updates: Code examples automatically update as you modify settings
đĄ Tips & Tricks
đ For Text Extraction
- Use Detailed mode for full formatting info
- Content mode is fastest for plain text needs
- Headings hierarchy is preserved in structure
- Use JSON format for programmatic processing
đ For Table Extraction
- Tables Only mode for focused table data
- CSV format works great for spreadsheet import
- Merged cells are handled automatically
- Column widths are preserved in metadata
đ For Metadata Extraction
- Great for document cataloging systems
- Includes word count, page count, and more
- Custom properties are extracted too
- Revision history provides change tracking
⥠Performance Tips
- Large documents may take longer to process
- Use Simple mode for faster extraction
- Tables Only mode skips text processing
- Batch operations are more efficient
đ Next Steps
What you can do with extracted data
After Extraction, You Can:
- Analyze Data: Use extracted tables and content for data analysis
- Generate New Documents: Create Word documents from extracted data using our Generate feature
- Archive Content: Store extracted text, tables, and metadata in your organization's archive
- Share Data: Export extracted data in various formats for team sharing