Add page-level text extraction for PDF/PPTX/DOCX documents by jeonsworld · Pull Request #1263 · microsoft/markitdown

jeonsworld · 2025-05-23T07:29:21Z

Summary

Adds optional page extraction to PDF, PPTX, and DOCX converters with extract_pages parameter, returning structured page data while maintaining full backward compatibility.

Motivation

Users need to process PDF/PPTX/DOCX pages separately and know which content comes from which page for page-aware applications. Additionally, local development settings should not be tracked in version control.

Changes

New PageInfo class: Stores page number and content
Enhanced DocumentConverterResult: Added optional pages attribute
Extended converters: Added extract_pages parameter for page-by-page processing in PDF, PPTX, and DOCX converters
CLI support: Added --extract-pages and --pages-json flags
Comprehensive tests: Test cases covering all scenarios for each format

Usage

Python API

# Traditional (unchanged)
result = md.convert("doc.pdf")

# New page extraction - works for PDF, PPTX, and DOCX
result = md.convert("doc.pdf", extract_pages=True)
result = md.convert("presentation.pptx", extract_pages=True)
result = md.convert("document.docx", extract_pages=True)

for page in result.pages:
    print(f"Page {page.page_number}: {page.content}")

CLI

# Extract pages with JSON output
markitdown doc.pdf --extract-pages --pages-json
markitdown presentation.pptx --extract-pages --pages-json
markitdown document.docx --extract-pages --pages-json

Resolved #210 #122

jeonsworld · 2025-05-23T07:30:53Z

@microsoft-github-policy-service agree

afourney · 2025-05-23T20:29:50Z

I like this idea. It meshes well with the pptx slide output as well.

I need to do a little testing before merging -- I'll try to do that this weekend.

mcchoe · 2025-06-12T03:31:20Z

Hi team - any ETA on the release of this PR? This would greatly help our project.

kanemaru-nec · 2025-06-12T08:50:32Z

@jeonsworld It seems that some statuses are on standby, and we need them for our project, so please move forward.

jeonsworld · 2025-06-12T12:19:31Z

@afourney Hi, the workflows for this PR are currently pending approval. Could you please review and approve them so the checks can run? Thank you.

gaccastro · 2025-07-03T08:05:26Z

Hello everyone and @afourney,

Apologies for the tagging but I was wondering if there is an ETA on this? It's something that would be very useful overall and also for a particular project my team is working on.

hkaraoguz · 2025-07-05T11:09:17Z

This feature will be very useful so I am also wondering when this can be approved. Thank you.

ttc-christopher-simmerman · 2025-07-14T17:08:33Z

Has this been implemented yet?

nuldertien · 2025-07-15T14:48:08Z

Would be helpful to me aswell!

Abhiraj-Alois · 2025-07-30T07:55:45Z

Hi! First of all, thank you so much for developing this feature. Could you please let me know when this version will be released? It would be incredibly helpful for my project!

dj953590 · 2025-08-03T00:08:49Z

Such an important feature can we get new build here 0.7.2 with extract_pages feature added to it

semor-joe · 2025-09-18T15:56:19Z

Hi, want to ask if this will be merged into the main branch? This is a really important feature

elieworkspace · 2025-12-06T20:03:04Z

Hello, could you please move it to the main branch? This feature is so important!! Thank you for this wonderful project.

zigarc · 2025-12-08T15:46:30Z

Please merge this pull request and release it. It is very important for our project.

jswaczyna · 2026-02-11T13:40:45Z

What is the status of this task??? When is this going to be added?

- Add PageInfo class to store page number and content - Enhance DocumentConverterResult with optional pages attribute - Extend PdfConverter with extract_pages parameter for page-by-page processing - Add CLI support with --extract-pages and --pages-json flags - Implement robust error handling with fallback to full document extraction - Maintain 100% backward compatibility with existing API - Add comprehensive test suite with 8 test cases covering all scenarios

- Add slide-level extraction for PPTX files with extract_pages parameter - Each slide is treated as a PageInfo object with sequential numbering - Add extract_pages parameter to DOCX for API consistency (returns None due to dynamic pagination) - Import PageInfo class in both converters to support the new functionality - Add comprehensive test suites for both formats ensuring backward compatibility - Maintain 100% backward compatibility with existing API

- Format all Python files with Black (v23.7.0) - Fix line length and formatting issues in page extraction feature files - Ensure consistent code style across the codebase

jeonsworld · 2026-02-13T11:06:01Z

@afourney @zashed Could you approve the CI workflows? Thank you!

jeonsworld changed the title ~~Add page-level text extraction for PDF documents~~ Add page-level text extraction for PDF/PPTX/DOCX documents May 23, 2025

zashed approved these changes Dec 7, 2025

View reviewed changes

jeonsworld and others added 4 commits February 13, 2026 19:59

style: apply Black formatting to fix pre-commit checks

413e60b

- Format all Python files with Black (v23.7.0) - Fix line length and formatting issues in page extraction feature files - Ensure consistent code style across the codebase

Fixed formatting.

600a983

jeonsworld force-pushed the main branch from 900233d to 600a983 Compare February 13, 2026 11:00

Conversation

jeonsworld commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Usage

Python API

CLI

Uh oh!

jeonsworld commented May 23, 2025

Uh oh!

afourney commented May 23, 2025

Uh oh!

mcchoe commented Jun 12, 2025

Uh oh!

kanemaru-nec commented Jun 12, 2025

Uh oh!

jeonsworld commented Jun 12, 2025

Uh oh!

gaccastro commented Jul 3, 2025

Uh oh!

hkaraoguz commented Jul 5, 2025

Uh oh!

ttc-christopher-simmerman commented Jul 14, 2025

Uh oh!

nuldertien commented Jul 15, 2025

Uh oh!

Abhiraj-Alois commented Jul 30, 2025

Uh oh!

dj953590 commented Aug 3, 2025

Uh oh!

semor-joe commented Sep 18, 2025

Uh oh!

elieworkspace commented Dec 6, 2025

Uh oh!

zigarc commented Dec 8, 2025

Uh oh!

jswaczyna commented Feb 11, 2026

Uh oh!

jeonsworld commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

jeonsworld commented May 23, 2025 •

edited

Loading