PowerFile Web Portal

ℹ️ Overview

Understanding Word document extraction capabilities

What is Word Document Extraction?

Extraction allows you to analyze and extract structured data from Word documents. DOCX Studio provides two powerful interfaces:

REST API

For developers to programmatically extract data

Automated extraction workflows
Multiple language support
System integration ready
Batch processing capability

DOCX Studio

Visual interface for interactive extraction

Drag-and-drop file upload
Interactive data preview
One-click downloads
No coding required

Key Capabilities

6 Extraction Modes: Flexible options from simple to detailed with content and metadata variations
Content Filtering: Filter by paragraphs, tables, headers, footers, footnotes, and more
Multiple Output Formats: JSON, YAML, TOML, CSV for different use cases
Style Extraction: Extract document styles, fonts, and formatting information
Table Extraction: Preserve table structure with merged cell handling
Batch Processing: Extract multiple content types in one operation
API Parity: All studio features available via REST API with 6 language examples

⭐ Extraction Features

📝

Text Extraction

Full document text with formatting preservation

Paragraphs with style information
Heading hierarchy detection
Font, size, and color preservation
Bold, italic, underline tracking
Structured output by section

📋

Table Extraction

Preserve structure and formatting of tables

Cell data and formatting
Merged cells handling
Column width preservation
Row and column spans
CSV and Excel export

🎨

Style Extraction

Capture document styles and themes

Named styles and custom styles
Font families and sizes
Color schemes and themes
Paragraph spacing and alignment
List and numbering formats

📊

Metadata Extraction

Document properties and statistics

Author and creation date
Word count and page count
Document revision history
Custom document properties
File size and format details

🔗

Content Analysis

Deep analysis of document structure

Section and heading structure
Hyperlink extraction
Cross-reference mapping
Table of contents analysis
Footnote and endnote extraction

⚡

Batch Processing

Process multiple documents efficiently

Multi-file upload support
Consistent output structure
Parallel processing capability
Combined result downloads
Progress tracking per file

🔧 Extraction Modes

Choose how much data to extract based on your needs

Mode 1: Detailed

Full extraction with complete document properties, including all paragraphs, tables, styles, metadata, and formatting information.

Output: Complete structured data with all properties
Best for: Complete content analysis, document migration, full reconstruction

Mode 2: Simple

Basic extraction with essential information only - just the core text and tables without extended properties.

Output: Essential data only
Best for: Quick analysis, reducing file size, core content only

Mode 3: Metadata

Extract only document metadata and properties - author, dates, statistics, and custom properties.

Output: Document properties and statistics
Best for: Document cataloging, compliance checks, audit trails

Mode 4: Content

Extract primary content - paragraphs, headings, and text elements without deep formatting details.

Output: Text content with basic structure
Best for: Content migration, text analysis, search indexing

Mode 5: Tables Only

Extract only table data from the document, preserving structure and cell values.

Output: Table data in structured format
Best for: Data extraction, spreadsheet conversion, tabular analysis

Mode 6: All

Complete extraction of everything - text, tables, metadata, styles, images, headers, footers, and all document elements.

Output: Everything in the document
Best for: Full archival, complete document reconstruction, comprehensive analysis

💻 API Usage Examples

Learn how to extract content from Word documents using the API in your preferred programming language

Extract Document Examples

# Extract with detailed mode, organized by content type
curl -X POST https://powerfile.io/docx/api/extract \
  -H "Authorization: YOUR_API_TOKEN" \
  -F "file=@document.docx" \
  -F "mode=detailed" \
  -F "file_format=json"

# Output structure: extracted_data/
#   ├── content/
#   │   ├── paragraphs.json
#   │   ├── headings.json
#   │   └── full_text.txt
#   ├── tables/
#   │   ├── table_metadata.json
#   │   ├── table_1.csv
#   │   └── table_2.csv
#   ├── metadata/
#   │   ├── document_properties.json
#   │   └── statistics.json
#   └── styles/
#       ├── styles.json
#       └── fonts.json

import requests
import json

# Configure API endpoint and authentication
url = "https://powerfile.io/docx/api/extract"
headers = {"Authorization": "YOUR_API_TOKEN"}

# Open and upload the Word document
with open("document.docx", "rb") as f:
    files = {"file": ("document.docx", f, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")}
    data = {
        "mode": "detailed",              # detailed, simple, metadata, content, tables, all
        "file_format": "json",           # json, yaml, toml, csv
    }
    
    response = requests.post(url, headers=headers, files=files, data=data)
    
    if response.status_code == 200:
        result = response.json()
        print("✅ Extraction successful!")
        print(f"Output directory: {result['output_dir']}")
        print(f"Paragraphs extracted: {result['paragraphs_count']}")
        print(f"Tables extracted: {result['tables_count']}")
        print(f"Styles found: {result['styles_count']}")
    else:
        print(f"❌ Error: {response.status_code}")
        print(response.json())

const fs = require('fs');
const FormData = require('form-data');
const axios = require('axios');

// Configure API endpoint and authentication
const url = 'https://powerfile.io/docx/api/extract';
const apiKey = 'YOUR_API_KEY';

// Create form data with file and parameters
const form = new FormData();
form.append('file', fs.createReadStream('document.docx'));
form.append('mode', 'detailed');              // detailed, simple, metadata, content, tables, all
form.append('file_format', 'json');           // json, yaml, toml, csv

// Make the API request
axios.post(url, form, {
  headers: {
    ...form.getHeaders(),
    'Authorization': apiKey
  }
}).then(response => {
  console.log('✅ Extraction successful!');
  console.log('Output directory:', response.data.output_dir);
  console.log('Paragraphs:', response.data.paragraphs_count);
  console.log('Tables:', response.data.tables_count);
  console.log('Styles:', response.data.styles_count);
}).catch(error => {
  console.error('❌ Error:', error.response?.data || error.message);
});

import java.io.File;
import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.mime.MultipartEntityBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;

// Configure API endpoint and authentication
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpPost uploadFile = new HttpPost("https://powerfile.io/docx/api/extract");
uploadFile.setHeader("Authorization", "YOUR_API_TOKEN");

// Build multipart form data
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addBinaryBody(
    "file", 
    new File("document.docx"),
    ContentType.create("application/vnd.openxmlformats-officedocument.wordprocessingml.document"),
    "document.docx"
);
builder.addTextBody("mode", "detailed");              // detailed, simple, metadata, content, tables, all
builder.addTextBody("file_format", "json");          // json, yaml, toml, csv

HttpEntity multipart = builder.build();
uploadFile.setEntity(multipart);

// Execute request
try (CloseableHttpResponse response = httpClient.execute(uploadFile)) {
    String responseBody = EntityUtils.toString(response.getEntity());
    JsonObject result = JsonParser.parseString(responseBody).getAsJsonObject();
    
    if (response.getStatusLine().getStatusCode() == 200) {
        System.out.println("✅ Extraction successful!");
        System.out.println("Output: " + result.get("output_dir").getAsString());
        System.out.println("Paragraphs: " + result.get("paragraphs_count").getAsInt());
        System.out.println("Tables: " + result.get("tables_count").getAsInt());
    } else {
        System.err.println("❌ Error: " + response.getStatusLine());
    }
} catch (IOException e) {
    e.printStackTrace();
}

using System;
using System.IO;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;

// Configure API endpoint and authentication
var client = new HttpClient();
client.DefaultRequestHeaders.Add("Authorization", "YOUR_API_TOKEN");

// Create multipart form content
var content = new MultipartFormDataContent();
using (var fileStream = File.OpenRead("document.docx"))
{
    var streamContent = new StreamContent(fileStream);
    streamContent.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue(
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    );
    content.Add(streamContent, "file", "document.docx");
    content.Add(new StringContent("detailed"), "mode");               // detailed, simple, metadata, content, tables, all
    content.Add(new StringContent("json"), "file_format");            // json, yaml, toml, csv

    // Make the API request
    var response = await client.PostAsync("https://powerfile.io/docx/api/extract", content);
    var responseBody = await response.Content.ReadAsStringAsync();
    
    if (response.IsSuccessStatusCode)
    {
        var result = JsonSerializer.Deserialize(responseBody);
        Console.WriteLine("✅ Extraction successful!");
        Console.WriteLine($"Output directory: {result.GetProperty(\"output_dir\").GetString()}");
        Console.WriteLine($"Paragraphs: {result.GetProperty(\"paragraphs_count\").GetInt32()}");
        Console.WriteLine($"Tables: {result.GetProperty(\"tables_count\").GetInt32()}");
        Console.WriteLine($"Styles: {result.GetProperty(\"styles_count\").GetInt32()}");
    }
    else
    {
        Console.WriteLine($"❌ Error: {response.StatusCode}");
        Console.WriteLine(responseBody);
    }
}

package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"mime/multipart"
	"net/http"
	"os"
)

func main() {
	// Open the Word document
	file, err := os.Open("document.docx")
	if err != nil {
		panic(err)
	}
	defer file.Close()

	// Create multipart form data
	body := &bytes.Buffer{}
	writer := multipart.NewWriter(body)
	part, _ := writer.CreateFormFile("file", "document.docx")
	io.Copy(part, file)
	
	// Add form fields
	writer.WriteField("mode", "detailed")              // detailed, simple, metadata, content, tables, all
	writer.WriteField("file_format", "json")          // json, yaml, toml, csv
	writer.Close()

	// Create and configure request
	req, _ := http.NewRequest("POST", "https://powerfile.io/docx/api/extract", body)
	req.Header.Set("Authorization", "YOUR_API_TOKEN")
	req.Header.Set("Content-Type", writer.FormDataContentType())

	// Execute request
	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		panic(err)
	}
	defer resp.Body.Close()

	// Parse response
	var result map[string]interface{}
	json.NewDecoder(resp.Body).Decode(&result)
	
	if resp.StatusCode == 200 {
		fmt.Println("✅ Extraction successful!")
		fmt.Println("Output:", result["output_dir"])
		fmt.Println("Paragraphs:", result["paragraphs_count"])
		fmt.Println("Tables:", result["tables_count"])
		fmt.Println("Styles:", result["styles_count"])
	} else {
		fmt.Printf("❌ Error: %d\n", resp.StatusCode)
	}
}

🎨 DOCX Studio Examples

Visual interface for interactive extraction

Getting Started in DOCX Studio Extraction

Navigate to DOCX Studio Extraction Tab
Upload Word document:
- Click or drag and drop a .docx or .doc file (Max 50MB)
- Supports single or multiple file uploads
Select an Extraction Mode:
- Detailed - Full extraction with all properties
- Simple - Basic extraction with minimal information
- Metadata - Document properties and statistics only
- Content - Primary text content and structure
- Tables Only - Extract only table data
- All - Complete extraction of everything
Choose Output Format: JSON, YAML, TOML, or CSV (impacts file structure and readability)
Click Extract: Process your file with selected options
Download Results: Individual files, summary, or complete ZIP archive
View API Examples: Switch to the API tab to see ready-to-use code in your preferred language

Studio Interface Components

File Upload Section: Drag-and-drop or click to upload Word documents (DOCX, DOC)
Extraction Mode Selector: 6 mode buttons for different extraction scenarios
Output Format Buttons: Choose between JSON, YAML, TOML, or CSV formats
API Code Examples Tab: Live code examples in cURL, Python, JavaScript, Java, C#, and Go
Real-time API Updates: Code examples automatically update as you modify settings

💡 Tips & Tricks

📝 For Text Extraction

Use Detailed mode for full formatting info
Content mode is fastest for plain text needs
Headings hierarchy is preserved in structure
Use JSON format for programmatic processing

📋 For Table Extraction

Tables Only mode for focused table data
CSV format works great for spreadsheet import
Merged cells are handled automatically
Column widths are preserved in metadata

📊 For Metadata Extraction

Great for document cataloging systems
Includes word count, page count, and more
Custom properties are extracted too
Revision history provides change tracking

⚡ Performance Tips

Large documents may take longer to process
Use Simple mode for faster extraction
Tables Only mode skips text processing
Batch operations are more efficient

🚀 Next Steps

What you can do with extracted data

After Extraction, You Can:

Analyze Data: Use extracted tables and content for data analysis
Generate New Documents: Create Word documents from extracted data using our Generate feature
Archive Content: Store extracted text, tables, and metadata in your organization's archive
Share Data: Export extracted data in various formats for team sharing

Learn More

Generate Guide

📄 Extract: Word Document Content Extraction