Skip to content

Enhance table processing flow#626

Open
shwetha-s-poojary wants to merge 8 commits intoIBM:mainfrom
shwetha-s-poojary:enhance_table_processing
Open

Enhance table processing flow#626
shwetha-s-poojary wants to merge 8 commits intoIBM:mainfrom
shwetha-s-poojary:enhance_table_processing

Conversation

@shwetha-s-poojary
Copy link
Copy Markdown
Member

@shwetha-s-poojary shwetha-s-poojary commented Apr 15, 2026

As a part of this PR :

  • Extended chunking pipeline to include table summaries alongside text summaries (previously only text summaries were chunked).

    • Introduce one more intermediate file *_table_chunk.json to process chunks.
    • Renamed create_chunk_document to merge_chunked_documents
  • Improved ingestion pipeline by exporting tables as Markdown, removing redundant caption handling, and adding minor cleanup and clarifying comments

  • Consolidated and simplified LLM utilities by introducing a single unified summary + classification prompt

    • Increased max_tokens to 1024

Copy link
Copy Markdown
Member

@manalilatkar manalilatkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM.

Comment thread spyre-rag/src/common/llm_utils.py Outdated
Comment thread spyre-rag/src/common/llm_utils.py Outdated
Comment thread spyre-rag/src/settings.json Outdated
Comment thread spyre-rag/src/common/llm_utils.py Outdated
Comment thread spyre-rag/src/common/llm_utils.py Outdated
Comment thread spyre-rag/src/common/llm_utils.py Outdated
Comment thread spyre-rag/src/common/llm_utils.py Outdated
Comment thread spyre-rag/src/common/misc_utils.py Outdated
Comment thread spyre-rag/src/digitize/doc_utils.py Outdated
Comment thread spyre-rag/src/digitize/doc_utils.py Outdated
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
@shwetha-s-poojary shwetha-s-poojary force-pushed the enhance_table_processing branch from 31f9661 to 70bed40 Compare April 16, 2026 11:45
Comment thread spyre-rag/src/digitize/doc_utils.py Outdated
Comment thread spyre-rag/src/digitize/doc_utils.py Outdated
"caption": caption,
"markdown_source": markdown_source,
"page_number": page_number,
"table_index": tab_idx,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_index isn't used right now- will remove it.But rest 3 are used later.

Comment thread spyre-rag/src/digitize/doc_utils.py
Comment thread spyre-rag/src/digitize/doc_utils.py Outdated
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
"query_vllm_stream_de": "Sie erhalten: 1. **Einen kurzen Kontexttext** mit sachlichen Informationen.\n2. **Die Frage eines Nutzers**, der um Klärung oder Rat bittet.\n3. **Geben Sie eine prägnante und aussagekräftige Antwort, die sich strikt auf den gegebenen Kontext stützt.**\n\nDie Antwort sollte korrekt, leicht verständlich und kontextbezogen sein sowie eine klare Begründung enthalten.\nWenn der Kontext nicht genügend Informationen liefert, antworten Sie mit Ihrem Allgemeinwissen.\n\nKontext:{context}\n\nFrage:{question}\n\nAntwort:",
"llm_classify": "You are an intelligent assistant helping to curate a knowledge base for a Retrieval-Augmented Generation (RAG) system.\nYour task is to decide whether the following text should be included in the knowledge corpus. Respond only with \"yes\" or \"no\".\n\nRespond \"yes\" if the text contains factual, instructional, or explanatory information that could help answer general user questions on any topic.\nRespond \"no\" if the text contains personal, administrative, or irrelevant content, such as names, acknowledgements, contact info, disclaimers, legal notices, or unrelated commentary.\n\nText: {text}\n\nAnswer:",
"table_summary": "You are an intelligent assistant analyzing set of documents.\nYou are given a table extracted from a document. Your task is to summarize the key points and insights from the table. Avoid repeating the entire content; focus on what is meaningful or important.\n\nTable:\n{content}\n\nSummary:",
"table_summary_and_classify": "You are an intelligent assistant analyzing tables extracted from documents.\n\nYour tasks:\n\n1. Extract and document EVERY piece of information from the table in extensive detail:\n - List ALL sections, subsections, and their reference numbers if present\n - Include EVERY specification, feature, number, code, condition, and requirement\n - Mention ALL items even if they seem minor - nothing should be omitted\n - Use structured format with clear organization (numbered lists, bullet points, or detailed paragraphs)\n - Be extremely thorough and comprehensive - aim for maximum detail\n - If the table has multiple rows/columns, describe each one\n - Preserve all technical terms, version numbers, and specific details exactly as shown\n\n2. Decide if the table is relevant for a knowledge base:\n - Relevant: contains factual, instructional, or explanatory info useful for answering questions.\n - Irrelevant: personal info, disclaimers, administrative notes, or unrelated commentary.\n\n3. Output in the exact format below:\n\nSummary: <your extremely detailed summary here - be verbose and comprehensive>\nDecision: <yes or no>\n\nDo NOT output JSON, extra commentary, or any other text.\n\nExamples:\n\nPositive example (relevant):\nTable:\n| Processor | Cores | Memory |\n|-----------|-------|--------|\n| Power10 | 16 | 8 TB |\n\nOutput:\nSummary: The table presents technical specifications for the Power10 processor. The processor configuration includes 16 cores for parallel processing capabilities. The memory capacity supports up to 8 TB (terabytes) of RAM, providing substantial memory resources for enterprise workloads and data-intensive applications.\nDecision: yes\n\nNegative example (irrelevant):\nTable:\n| Prepared by: | John Smith |\n|--------------|------------|\n\nOutput:\nSummary: Document metadata indicating it was prepared by John Smith.\nDecision: no\n\nNow analyze the table below:\n\nTable:\n{content}",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New prompt:

You are an intelligent assistant analyzing tables extracted from documents.

Your tasks:

1. Extract and document EVERY piece of information from the table in extensive detail:
   - List ALL sections, subsections, and their reference numbers if present
   - Include EVERY specification, feature, number, code, condition, and requirement
   - Mention ALL items even if they seem minor - nothing should be omitted
   - Use structured format with clear organization (numbered lists, bullet points, or detailed paragraphs)
   - Be extremely thorough and comprehensive - aim for maximum detail
   - If the table has multiple rows/columns, describe each one
   - Preserve all technical terms, version numbers, and specific details exactly as shown

2. Decide if the table is relevant for a knowledge base:
   - Relevant: contains factual, instructional, or explanatory info useful for answering questions.
   - Irrelevant: personal info, disclaimers, administrative notes, or unrelated commentary.

3. Output in the exact format below:

Summary: <your extremely detailed summary here - be verbose and comprehensive>
Decision: <yes or no>

Do NOT output JSON, extra commentary, or any other text.

Examples:

Positive example (relevant):
Table:
| Processor | Cores | Memory |
|-----------|-------|--------|
| Power10   | 16    | 8 TB   |

Output:
Summary: The table presents technical specifications for the Power10 processor. The processor configuration includes 16 cores for parallel processing capabilities. The memory capacity supports up to 8 TB (terabytes) of RAM, providing substantial memory resources for enterprise workloads and data-intensive applications.
Decision: yes

Negative example (irrelevant):
Table:
| Prepared by: | John Smith |
|--------------|------------|

Output:
Summary: Document metadata indicating it was prepared by John Smith.
Decision: no

Now analyze the table below:

Table:
{content}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the prompt length may exhaust your max_tokens limit since you still need to include the source text. You also need to assess the impact of this change.

@mkumatag
Copy link
Copy Markdown
Member

for better visibility: Increasing the prompt length may exhaust your max_tokens limit since you still need to include the source text. You also need to assess the impact of this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants