Tutorials

An Overview of ByteDance’s Document Parsing Model, Dolphin

ByteDance is a Chinese technology company that has developed novel video-sharing social networking applications, most notably TikTok. They’ve also made impressive contributions to the AI industry, such as open-sourcing Monolith, a high-throughput, low-latency deep learning framework for large-scale recommendation modeling. There’s been a slew of recent releases including Bagel, an open-source multimodal foundation model with image generation and editing capabilities; Trae, an AI assistant designed for programmers that can answer coding queries, complete code snippets, and develop entire projects from prompts; DAPO, the distributed reinforcement-learning framework for LLM optimization; and UI-TARS, the open-source agent for automating GUI interactions.

Additionally, they introduced Dolphin, (Document Image Parsing via Heterogenous Anchor Prompting), a new multimodal document image parsing model. We’ve been covering open-source document processing models with our SmolDocling (from IBM research and HuggingFace) and olmOCR/rolmOCR (from AllenAI and Reducto) articles and are therefore very excited to explore ByteDance’s contribution to the space.

There are two parts to this tutorial. (1) An overview covering the model architecture and training methodology and (2) an implementation where we run the model. We’ll show you how you can run these models on DigitalOcean GPU Droplets.

The topics presented in the overview section of this article requires familiarity with transformers, the attention mechanism (self-attention and cross-attention), Vision Language Models (VLMs), etc. The implementation section may require some familiarity with the command-line.

Feel free to skip sections that aren’t of use to you.

The researchers’ motivation for developing such a model is that there are limitations with current integration-based document parsing solutions and Vision Language Model (VLM) solutions.

Existing document Parsing methods mentioned in the Dolphin paper are depicted below. The repositories are linked for your exploration. If the repository was not found, blog posts or papers are linked.

In the paper, the researchers explain that integration-based document parsing solutions require independent optimization of different OCR tasks (e.g., layout detection, reading order prediction, and recognition of textlines, formula, or tables) and existing VLM solutions (both general and expert VLMs) experience layout structure degradation and efficiency bottlenecks when parsing lengthy documents with complicated layouts. As a result of the limitations faced by existing solutions, ByteDance proposes Dolphin for document processing.

Dolphin follows an “analyze-then-parse” approach to extract structured content from documents. The first of these two stages, the analyze stage, involves analyzing the page-level layout to extract elements in reading order. The extracted elements are used in the second parse stage to parallel parse individual elements.This approach of parallel processing, when paired with prompts tailored to specific elements, allows for computational efficiency and accurate content identification.

Dolphin leverages an encoder-decoder transformer architecture. The encoder is a Swin Transformer where a page image is taken as an input and outputs a sequence of visual embeddings. Using the cross-attention mechanism and the prompt, “Parse the reading order of this document”, the mBart decoder attends to the encoded visual features to derive sequential layout elements that preserve structural relationships.

The second stage uses layout elements to parse content in parallel, making it efficient while keeping element-specific details. This happens in two steps:

Element Image Encoding: Each layout element’s region is cropped from the original image to create a local view. These views are encoded using the Swin Transformer to produce element-specific visual features.

Parallel Content Parsing: With these encoded features, the decoder generates parsed content for each element in parallel.

Dolphin was initialized with pretrained weights from Donut. The training dataset used for instruction tuning includes 30 million samples, covering both page-level documents and element-level components. The table below goes into detail about the types of data samples as well as how these different data formats were processed for either Dolphin’s layout or parsing stage.

Data Source Processing/Rendering Method Annotation/Tagging Details Number of Samples Task Type(s)
Mixed Documents Collected from diverse sources (educational materials, publications, business documents). Manually annotated with element-level boundaries and their reading order. 0.12M Layout
HTML Synthetic training data generated through web rendering (e.g., Chinese and English Wikipedia articles). Random font selection applied for visual diversity. HTML content processed by adding span tags for character-level annotation. Comprehensive bounding box annotations at character, word, line, and paragraph levels obtained. 4.37M Parsing
LaTeX Processed using LaTeX Rainbow, a specialized rendering framework that preserves hierarchical structure. Different elements (formulas, figures) rendered with distinct colors. XeTeX tool used for rendering formula images. Rendered documents automatically parsed to extract element types, hierarchical relationships, and spatial locations at block, line, and word levels. Formula expressions collected in LaTeX format. 0.5M Parsing
Markdown Processed using Pandoc for PDF rendering with customized templates. PyMuPDF-based parsing and content alignment with source markdown to obtain hierarchical text annotations at paragraph, line, and word levels, as well as specific element types like tables. Formula blocks found based on pixel matching after rendering in different colors. 0.71M Parsing
Tables Utilized existing large-scale datasets: PubTabNet (568K tables) and PubTab1M (1M tables). PubTabNet provides HTML annotations. PubTab1M provides more fine-grained structure annotations. 1.57M Parsing
Formulas Formula expressions in LaTeX format from arXiv. Rendered into formula images using the XeTeX tool. Various backgrounds and fonts used in rendering. The LaTeX format itself serves as the ground truth. 23M Parsing

Dolphin offers two inference frameworks that support parsing documents at two different levels. The first level is page-level parsing, where the entire document page is converted into a structured format using JSON and Markdown. The second level is element-level parsing, which breaks down the document into individual components, such as text, tables, and formulas, for more detailed analysis.

Step 1 : Set up a GPU Droplet

Begin by setting up a DigitalOcean GPU Droplet, select AI/ML and choose the NVIDIA H100 option.

Step 2: SSH

SSH into your favourite code editor or terminal

ssh root@

Step 3: Install Dependencies

In the terminal, copy and paste the following code snippet:

apt install python3-pip python3.10

Step 4: Install Conda

Download the miniconda installer:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Run the installer:

bash Miniconda3-latest-Linux-x86_64.sh

Now let’s set up the Dolphin project and download the necessary model.

Step 5: Create a Conda Environment and Clone Dolphin

conda create -n ai python=3.11 -y && conda activate ai git clone https://github.com/ByteDance/Dolphin.git && cd Dolphin

This command creates a new Conda environment named ai with Python 3.11, activates it, then clones the Dolphin repository and navigates into its directory.

Step 6: Install Python Requirements

Next, install all the Python libraries Dolphin needs:

pip install -r requirements.txt huggingface_hub

This installs everything listed in Dolphin’s requirements.txt file, plus huggingface_hub for interacting with Hugging Face.

Step 7: Prepare for Model Download

We need a spot to save the model:

mkdir hf_model

Log in to Hugging Face:
You’ll need a Hugging Face access token to download the model. If you don’t have one, create it on the Hugging Face website under your profile settings (Settings -> Access Tokens).

Then, log in via the command line:

huggingface-cli login

Paste your token when prompted.

Step 8: Download the Dolphin Model

Finally, download the model files directly into your new directory:

huggingface-cli download ByteDance/Dolphin –local-dir ./hf_model

Step 9: Run Inference

Let’s run inference on the images provided in the Dolphin demo folder

# Process a single document image python demo_page_hf.py –model_path ./hf_model –input_path ./demo/page_imgs/page_1.jpeg –save_dir ./results # Process a single document pdf python demo_page_hf.py –model_path ./hf_model –input_path ./demo/page_imgs/page_6.pdf –save_dir ./results # Process all documents in a directory python demo_page_hf.py –model_path ./hf_model –input_path ./demo/page_imgs –save_dir ./results # Process with custom batch size for parallel element decoding python demo_page_hf.py –model_path ./hf_model –input_path ./demo/page_imgs –save_dir ./results –max_batch_size 16

Let’s take a look at the page_1 output

Below is the markdown output:

Below is the json output:

We’re pretty pleased with the model’s performance! Try both page-level and element-level parsing and let us know what you think in the comments below:)

In summary, ByteDance’s Dolphin model presents a promising approach to document parsing by utilizing an analyze-then-parse strategy. This method, leveraging Heterogenous Anchor Prompting, allows for both accuracy and efficiency, addressing limitations found in existing integration-based and VLM solutions. We went over the model architecture, training process, and performance on Document Processing evals. Additionally, we showed you how you can run Dolphin’s page-level and element-level parsing options on DigitalOcean GPU Droplets.

Happy experimenting!

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button