📄 Extract: Word Document Content Extraction

Powerful extraction tools to analyze and extract text, tables, metadata, styles, and content from Word documents. Perfect for data analysis, content migration, and automation workflows.

â„šī¸ Overview

Understanding Word document extraction capabilities

What is Word Document Extraction?

Extraction allows you to analyze and extract structured data from Word documents. DOCX Studio provides two powerful interfaces:

REST API

For developers to programmatically extract data

  • Automated extraction workflows
  • Multiple language support
  • System integration ready
  • Batch processing capability
DOCX Studio

Visual interface for interactive extraction

  • Drag-and-drop file upload
  • Interactive data preview
  • One-click downloads
  • No coding required
Key Capabilities
  • 6 Extraction Modes: Flexible options from simple to detailed with content and metadata variations
  • Content Filtering: Filter by paragraphs, tables, headers, footers, footnotes, and more
  • Multiple Output Formats: JSON, YAML, TOML, CSV for different use cases
  • Style Extraction: Extract document styles, fonts, and formatting information
  • Table Extraction: Preserve table structure with merged cell handling
  • Batch Processing: Extract multiple content types in one operation
  • API Parity: All studio features available via REST API with 6 language examples

⭐ Extraction Features

📝

Text Extraction

Full document text with formatting preservation

  • Paragraphs with style information
  • Heading hierarchy detection
  • Font, size, and color preservation
  • Bold, italic, underline tracking
  • Structured output by section
📋

Table Extraction

Preserve structure and formatting of tables

  • Cell data and formatting
  • Merged cells handling
  • Column width preservation
  • Row and column spans
  • CSV and Excel export
🎨

Style Extraction

Capture document styles and themes

  • Named styles and custom styles
  • Font families and sizes
  • Color schemes and themes
  • Paragraph spacing and alignment
  • List and numbering formats
📊

Metadata Extraction

Document properties and statistics

  • Author and creation date
  • Word count and page count
  • Document revision history
  • Custom document properties
  • File size and format details
🔗

Content Analysis

Deep analysis of document structure

  • Section and heading structure
  • Hyperlink extraction
  • Cross-reference mapping
  • Table of contents analysis
  • Footnote and endnote extraction
⚡

Batch Processing

Process multiple documents efficiently

  • Multi-file upload support
  • Consistent output structure
  • Parallel processing capability
  • Combined result downloads
  • Progress tracking per file

🔧 Extraction Modes

Choose how much data to extract based on your needs

Mode 1: Detailed

Full extraction with complete document properties, including all paragraphs, tables, styles, metadata, and formatting information.

Output: Complete structured data with all properties
Best for: Complete content analysis, document migration, full reconstruction
Mode 2: Simple

Basic extraction with essential information only - just the core text and tables without extended properties.

Output: Essential data only
Best for: Quick analysis, reducing file size, core content only
Mode 3: Metadata

Extract only document metadata and properties - author, dates, statistics, and custom properties.

Output: Document properties and statistics
Best for: Document cataloging, compliance checks, audit trails
Mode 4: Content

Extract primary content - paragraphs, headings, and text elements without deep formatting details.

Output: Text content with basic structure
Best for: Content migration, text analysis, search indexing
Mode 5: Tables Only

Extract only table data from the document, preserving structure and cell values.

Output: Table data in structured format
Best for: Data extraction, spreadsheet conversion, tabular analysis
Mode 6: All

Complete extraction of everything - text, tables, metadata, styles, images, headers, footers, and all document elements.

Output: Everything in the document
Best for: Full archival, complete document reconstruction, comprehensive analysis

đŸ’ģ API Usage Examples

Learn how to extract content from Word documents using the API in your preferred programming language

Extract Document Examples

# Extract with detailed mode, organized by content type
curl -X POST https://powerfile.io/docx/api/extract \
  -H "Authorization: YOUR_API_TOKEN" \
  -F "file=@document.docx" \
  -F "mode=detailed" \
  -F "file_format=json"

# Output structure: extracted_data/
#   ├── content/
#   │   ├── paragraphs.json
#   │   ├── headings.json
#   │   └── full_text.txt
#   ├── tables/
#   │   ├── table_metadata.json
#   │   ├── table_1.csv
#   │   └── table_2.csv
#   ├── metadata/
#   │   ├── document_properties.json
#   │   └── statistics.json
#   └── styles/
#       ├── styles.json
#       └── fonts.json
import requests
import json

# Configure API endpoint and authentication
url = "https://powerfile.io/docx/api/extract"
headers = {"Authorization": "YOUR_API_TOKEN"}

# Open and upload the Word document
with open("document.docx", "rb") as f:
    files = {"file": ("document.docx", f, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")}
    data = {
        "mode": "detailed",              # detailed, simple, metadata, content, tables, all
        "file_format": "json",           # json, yaml, toml, csv
    }
    
    response = requests.post(url, headers=headers, files=files, data=data)
    
    if response.status_code == 200:
        result = response.json()
        print("✅ Extraction successful!")
        print(f"Output directory: {result['output_dir']}")
        print(f"Paragraphs extracted: {result['paragraphs_count']}")
        print(f"Tables extracted: {result['tables_count']}")
        print(f"Styles found: {result['styles_count']}")
    else:
        print(f"❌ Error: {response.status_code}")
        print(response.json())
const fs = require('fs');
const FormData = require('form-data');
const axios = require('axios');

// Configure API endpoint and authentication
const url = 'https://powerfile.io/docx/api/extract';
const apiKey = 'YOUR_API_KEY';

// Create form data with file and parameters
const form = new FormData();
form.append('file', fs.createReadStream('document.docx'));
form.append('mode', 'detailed');              // detailed, simple, metadata, content, tables, all
form.append('file_format', 'json');           // json, yaml, toml, csv

// Make the API request
axios.post(url, form, {
  headers: {
    ...form.getHeaders(),
    'Authorization': apiKey
  }
}).then(response => {
  console.log('✅ Extraction successful!');
  console.log('Output directory:', response.data.output_dir);
  console.log('Paragraphs:', response.data.paragraphs_count);
  console.log('Tables:', response.data.tables_count);
  console.log('Styles:', response.data.styles_count);
}).catch(error => {
  console.error('❌ Error:', error.response?.data || error.message);
});
import java.io.File;
import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.mime.MultipartEntityBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;

// Configure API endpoint and authentication
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpPost uploadFile = new HttpPost("https://powerfile.io/docx/api/extract");
uploadFile.setHeader("Authorization", "YOUR_API_TOKEN");

// Build multipart form data
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addBinaryBody(
    "file", 
    new File("document.docx"),
    ContentType.create("application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
    "document.docx"
);
builder.addTextBody("mode", "detailed");              // detailed, simple, metadata, content, tables, all
builder.addTextBody("file_format", "json");          // json, yaml, toml, csv

HttpEntity multipart = builder.build();
uploadFile.setEntity(multipart);

// Execute request
try (CloseableHttpResponse response = httpClient.execute(uploadFile)) {
    String responseBody = EntityUtils.toString(response.getEntity());
    JsonObject result = JsonParser.parseString(responseBody).getAsJsonObject();
    
    if (response.getStatusLine().getStatusCode() == 200) {
        System.out.println("✅ Extraction successful!");
        System.out.println("Output: " + result.get("output_dir").getAsString());
        System.out.println("Paragraphs: " + result.get("paragraphs_count").getAsInt());
        System.out.println("Tables: " + result.get("tables_count").getAsInt());
    } else {
        System.err.println("❌ Error: " + response.getStatusLine());
    }
} catch (IOException e) {
    e.printStackTrace();
}
using System;
using System.IO;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;

// Configure API endpoint and authentication
var client = new HttpClient();
client.DefaultRequestHeaders.Add("Authorization", "YOUR_API_TOKEN");

// Create multipart form content
var content = new MultipartFormDataContent();
using (var fileStream = File.OpenRead("document.docx"))
{
    var streamContent = new StreamContent(fileStream);
    streamContent.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue(
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    );
    content.Add(streamContent, "file", "document.docx");
    content.Add(new StringContent("detailed"), "mode");               // detailed, simple, metadata, content, tables, all
    content.Add(new StringContent("json"), "file_format");            // json, yaml, toml, csv

    // Make the API request
    var response = await client.PostAsync("https://powerfile.io/docx/api/extract", content);
    var responseBody = await response.Content.ReadAsStringAsync();
    
    if (response.IsSuccessStatusCode)
    {
        var result = JsonSerializer.Deserialize(responseBody);
        Console.WriteLine("✅ Extraction successful!");
        Console.WriteLine($"Output directory: {result.GetProperty(\"output_dir\").GetString()}");
        Console.WriteLine($"Paragraphs: {result.GetProperty(\"paragraphs_count\").GetInt32()}");
        Console.WriteLine($"Tables: {result.GetProperty(\"tables_count\").GetInt32()}");
        Console.WriteLine($"Styles: {result.GetProperty(\"styles_count\").GetInt32()}");
    }
    else
    {
        Console.WriteLine($"❌ Error: {response.StatusCode}");
        Console.WriteLine(responseBody);
    }
}
package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"mime/multipart"
	"net/http"
	"os"
)

func main() {
	// Open the Word document
	file, err := os.Open("document.docx")
	if err != nil {
		panic(err)
	}
	defer file.Close()

	// Create multipart form data
	body := &bytes.Buffer{}
	writer := multipart.NewWriter(body)
	part, _ := writer.CreateFormFile("file", "document.docx")
	io.Copy(part, file)
	
	// Add form fields
	writer.WriteField("mode", "detailed")              // detailed, simple, metadata, content, tables, all
	writer.WriteField("file_format", "json")          // json, yaml, toml, csv
	writer.Close()

	// Create and configure request
	req, _ := http.NewRequest("POST", "https://powerfile.io/docx/api/extract", body)
	req.Header.Set("Authorization", "YOUR_API_TOKEN")
	req.Header.Set("Content-Type", writer.FormDataContentType())

	// Execute request
	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		panic(err)
	}
	defer resp.Body.Close()

	// Parse response
	var result map[string]interface{}
	json.NewDecoder(resp.Body).Decode(&result)
	
	if resp.StatusCode == 200 {
		fmt.Println("✅ Extraction successful!")
		fmt.Println("Output:", result["output_dir"])
		fmt.Println("Paragraphs:", result["paragraphs_count"])
		fmt.Println("Tables:", result["tables_count"])
		fmt.Println("Styles:", result["styles_count"])
	} else {
		fmt.Printf("❌ Error: %d\n", resp.StatusCode)
	}
}

🎨 DOCX Studio Examples

Visual interface for interactive extraction

Getting Started in DOCX Studio Extraction
  1. Navigate to DOCX Studio Extraction Tab
  2. Upload Word document:
    • Click or drag and drop a .docx or .doc file (Max 50MB)
    • Supports single or multiple file uploads
  3. Select an Extraction Mode:
    • Detailed - Full extraction with all properties
    • Simple - Basic extraction with minimal information
    • Metadata - Document properties and statistics only
    • Content - Primary text content and structure
    • Tables Only - Extract only table data
    • All - Complete extraction of everything
  4. Choose Output Format: JSON, YAML, TOML, or CSV (impacts file structure and readability)
  5. Click Extract: Process your file with selected options
  6. Download Results: Individual files, summary, or complete ZIP archive
  7. View API Examples: Switch to the API tab to see ready-to-use code in your preferred language
Studio Interface Components
  • File Upload Section: Drag-and-drop or click to upload Word documents (DOCX, DOC)
  • Extraction Mode Selector: 6 mode buttons for different extraction scenarios
  • Output Format Buttons: Choose between JSON, YAML, TOML, or CSV formats
  • API Code Examples Tab: Live code examples in cURL, Python, JavaScript, Java, C#, and Go
  • Real-time API Updates: Code examples automatically update as you modify settings

💡 Tips & Tricks

📝 For Text Extraction
  • Use Detailed mode for full formatting info
  • Content mode is fastest for plain text needs
  • Headings hierarchy is preserved in structure
  • Use JSON format for programmatic processing
📋 For Table Extraction
  • Tables Only mode for focused table data
  • CSV format works great for spreadsheet import
  • Merged cells are handled automatically
  • Column widths are preserved in metadata
📊 For Metadata Extraction
  • Great for document cataloging systems
  • Includes word count, page count, and more
  • Custom properties are extracted too
  • Revision history provides change tracking
⚡ Performance Tips
  • Large documents may take longer to process
  • Use Simple mode for faster extraction
  • Tables Only mode skips text processing
  • Batch operations are more efficient

🚀 Next Steps

What you can do with extracted data

After Extraction, You Can:
  • Analyze Data: Use extracted tables and content for data analysis
  • Generate New Documents: Create Word documents from extracted data using our Generate feature
  • Archive Content: Store extracted text, tables, and metadata in your organization's archive
  • Share Data: Export extracted data in various formats for team sharing
Learn More