There are at least two issues in your implementation:
- Your translation from
CURL -X PUT to TIdHTTP is wrong.
- You don't specify
Accept HTTP header to retrieve the extracted text in specific format.
How to translate curl -X PUT to Indy?
At first, lets make it clear that curl -X PUT --data-binary @<filename> <url> is the same as curl -T <filename> <url> when:
<url>'s scheme is HTTP or HTTPS
<url> does not end with /
Therefore using one or the other shouldn't matter in your case. See also curl documentation.
Secondly, TIdMultiPartFormDataStream is designed for use with POST verb, however nothing can stop you from passing it to TIdHTTP.Put, because it is indirectly derived from TStream. There even is a dedicated invariant of TIdHTTP.Post method that accepts TIdMultiPartFormDataStream:
function Post(AURL: string; ASource: TIdMultiPartFormDataStream): string; overload;
To upload file to the service just use TIdHTTP.Put method with TFileStream as an argument while providing proper content type of the file being uploaded in HTTP header.
And finally you're trying to extract plain text from the document, but you didn't specify content type that the service should return. This is done via Accept HTTP header. Default instance of TIdHTTP has property IdHTTP.Request.Accept initialized to 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' (this may vary depending on Indy version). Therefore by default Tika will return HTML formatted text. To get the plain text you should change it to 'text/plain; charset=utf-8'.
Fixed implementation:
uses IdGlobal, IdHTTP;
function GetDocumentText(const FileName, ContentType: string): string;
var
IdHTTP: TIdHTTP;
Stream: TIdReadFileExclusiveStream;
begin
IdHTTP := TIdHTTP.Create;
try
IdHTTP.Request.Accept := 'text/plain; charset=utf-8';
IdHTTP.Request.ContentType := ContentType;
Stream := TIdReadFileExclusiveStream.Create(FileName);
try
Result := IdHTTP.Put('http://localhost:9998/tika', Stream);
finally
Stream.Free;
end;
finally
IdHTTP.Free;
end;
end;
function GetPDFText(const FileName: string): string;
const
PDFContentType = 'application/pdf';
begin
Result := GetDocumentText(FileName, PDFContentType);
end;
function GetDOCXText(const FileName: string): string;
const
DOCXContentType = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document';
begin
Result := GetDocumentText(FileName, DOCXContentType);
end;
According to the Tika's documentation it also supports posting multipart form data. If you insist on using this approach, then you should change the target resource to /tika/form and switch to Post method in your implementation:
function GetDocumentText(const FileName, ContentType: string): string;
var
IdHTTP: TIdHTTP;
FormData: TIdMultiPartFormDataStream;
begin
IdHTTP := TIdHTTP.Create;
try
IdHTTP.Request.Accept := 'text/plain; charset=utf-8';
FormData := TIdMultiPartFormDataStream.Create;
try
FormData.AddFile('file', FileName, ContentType); { older Indy versions: FormData.Add(...) }
Result := IdHTTP.Post('http://localhost:9998/tika/form', FormData);
finally
FormData.Free;
end;
finally
IdHTTP.Free;
end;
end;
Why does the original implementation in question work with PDF files?
When you Post multipart form data via TIdHTTP, Indy automatically sets content type of the request to 'multipart/form-data; boundary=...whatever...'. This is not the case when you Put (unless you set it manually before performing the request) data and therefore TIdHttp.Request.ContentType remains blank. Now I can only guess that when Tika sees empty content type it falls back to some default type which could be PDF and it's still somehow able to read the document from multipart request.