Rares Antalut, parteneriat cu Ramon-Gabriel Constructions

etichete_1

I've updated the "AWS Textract Document Processor" Canvas to include outputs in XML and HTML formats, utilizing only the data from the Textract JSON output. Here's the modified script: Changes Made: XML Generation: A new function, extract_and_save_xml, has been added. It takes the Textract JSON output and constructs an XML tree, including document metadata, blocks, text, geometry (bounding boxes), and relationships. It uses Python's built-in xml.etree.ElementTree module for XML creation. The XML output for each document will be saved as _textract.xml in the textract_results directory. HTML Generation: A new function, extract_and_save_html, has been added. This function generates a basic HTML file that visually represents the extracted text blocks (lines and words) using their bounding box coordinates for positioning. This provides a visual overlay of the Textract output on a simulated document layout. It's a simplified representation, primarily focusing on LINE blocks and their corresponding text and position. The HTML output for each document will be saved as _textract.html in the textract_results directory. Main Logic Update: The if __name__ == "__main__": block now calls both extract_and_save_xml and extract_and_save_html after successfully retrieving and saving the JSON, raw text CSV, and layout CSV files. These additions expand the script's output capabilities, providing you with more ways to consume and visualize your Textract results directly from the generated files.

Rares Antalut, parteneriat cu Ramon-Gabriel Constructions

Before and After