drewnoakes/metadata-extractor
View on GitHubAdd tags for MS Document Text, MS Property Storage
Open
#579 opened on Jun 2, 2022
format-tiffhelp wantedimage-queue
Description
I recently ran exiftool on a bunch of tiffs that we have in our regression corpus on Apache Tika. I was interested to see that there can be text (OCR'd or original) for the underlying document stored in what exiftool calls "MS Document Text", which is currently an unknown tag with value 0x932f. There's also MS Property Set Storage (0x9330)
An example file is here: https://corpora.tika.apache.org/base/docs/commoncrawl3/RD/RDAFESH5CBBJWWQZMZR4MGJIPYYEL7DN
This is what exiftool extracts from the file:
ExifTool Version Number : 12.42
File Name : RDAFESH5CBBJWWQZMZR4MGJIPYYEL7DN
Directory : /data1/docs/commoncrawl3/RD
File Size : 38 kB
File Modification Date/Time : 2018:11:05 02:38:44+01:00
File Access Date/Time : 2022:06:01 15:25:59+02:00
File Inode Change Date/Time : 2020:06:10 23:11:36+02:00
File Permissions : -rwxr-xr-x
File Type : TIFF
File Type Extension : tif
MIME Type : image/tiff
Exif Byte Order : Little-endian (Intel, II)
Image Width : 1760
Image Height : 2800
Bits Per Sample : 1
Compression : T6/Group 4 Fax
Photometric Interpretation : WhiteIsZero
Strip Offsets : 8
Samples Per Pixel : 1
Rows Per Strip : 2800
Strip Byte Counts : 23737
X Resolution : 200
Y Resolution : 200
Resolution Unit : inches
Software : HATFILT Version 1.8
Subfile Type : Reduced-resolution image
Preview Image Start : 27248
Preview Image Length : 5225
JPEG Proc : Baseline
Jpg From Raw Start : 27248
Jpg From Raw Length : 5225
MS Document Text : .d.CÂMARA. ..MUNICIPAL DE VARGEM ALTA. ..ESTADO DO ESPíRITO SANTO. .DECRETO LEGISLATIVO N° 032197. ..APROVA AS CONTAS. .MUNICIPAL DE VARGEM. .ESPíRITO SANTO,. .ExERCIdo DE 1996.. ..DA PREFEITURA. .ALTA, EST>
MS Property Set Storage : (Binary data 5632 bytes, use -b option to extract)
MS Document Text Position : (Binary data 2110 bytes, use -b option to extract)
Image Size : 1760x2800
Jpg From Raw : (Binary data 5225 bytes, use -b option to extract)
Megapixels : 4.9
Preview Image : (Binary data 5225 bytes, use -b option to extract)
The exiftool dumps of the tiffs are available as tiffs-*.gz here: https://corpora.tika.apache.org/base/share/