I am trying to write a program that will parse PDF files. From what I've read so far, PDF files are a combination of text data and binary data.
For example, within the PDF, there are "stream" objects which start with the word stream(newline), then contain the necessary binary data, and then end with the word endstream(newline)
Here is an example:
677 0 obj
<</Length 2821
/Filter /FlateDecode>>
stream
xœ…ZÛnG}÷WLô’, Oú~AÙ 6—AGbKœ5Éf†¦õ™ù£=M™dŸÖX›‡ÄÏô¥êTÕ©j¸yLœ“·¡5J»æfõæÇ_T#M+ðOssÿ懫÷ëë57ÿ}óï›7nÞH[c¼£o~Ø»©Oûñúª„FÝ-yùÆýöv“VP9Ù¯ª5çu*1!´R‡j±iجú48}k…¶ŒûeìvwC?•71B´2hÁÈ~WBpYeUµØôØÏåÑŒµ¦Zh¸/!Þ´ÿbȺßNis_Ê‹¯1²u'ÜÕɼmiÞ³óþp†ÞtŸRs×éúª™º~Õ¼»-.Þ\ͧ_¾[\E„VœWùã¾9¬»ù]³JÝØüÖ6«ïMãí¦{:-û¡ë§´MÝþËu³î6÷M¿›‡/ý]7§U»°¾‹²U—S’o3§œ”A›4ÏäZi=h"
Ãëþn]E¨V™(U‘IùÖŽ0wÃ~d2)£¡w;ö»‡©äœÒv±:ÿ<”ìϨÍk§ÒÎã'_üaøœÆÝPËhÑ*ªã3áLh«OÞ•q?Œó˜¦òvÙ@ÇX½ŸXë@ó"6·iî·ijº]3Œ«´L_¯^˜þ;ø¶yjºÍ¡{š–ñFtÿyhvÃü¼üÔôÓ´OÍý8l˜¶Ùö»~‡¦w‹!n.Œÿ;!Fö»Õ°¿ƒ7O?5·ûùŸÅï„/˜ü~Wf2©Tk£Ê¡D<m”Ùò%ˆè"½k~® ë4¥n…õ–Qç[‘m«•1Œ\•[ªèZc£bŒ@DW4¦²âf—÷ëvUzU^ÅW(1pC1&/ÔÍû‘k2åK$QÞ
äØ(Íkžˆj£¸õÂy‘!Ö낚¦qÝ=–V…)àHïø4ìÉ“fµŠ1]NFßjªuÆþa=—©Ti‹_fØï ù·Ü j¯ÉW[ƒ¤Dèïg
\m\®–Õ’Xð©ùØJ!"ƒæy!Ny †Â0ð°NÄPÛi_3š¶…áÌkF;»ÒºK>˜úy†7©¶À™ZhÆÝ&$ºTÅ$2‹aXGŽeœ°ŒxR2µQ8Ï ¹ƒà)}£ŒUÃ~üÅðšÓœÞvoóJgðÉ™úSÖl‡~.K¢ÖÁã·/§Q.¬s‘÷(‡»Õ?ï×ßQÌzœ]W†ÿXÑÀj©j“—nÁZtå°€ê\mºrÊc7¾]âꂸd¶‰VhíeÐIUýT"] 2F~$À˜9OtzÊT©dÈ2Ä1¬'ʸ51VÊÛÈ0
¹ìÛÇ<mQç«}kJíA·^Cé&íº¢––UºÚŒ#‹’c=c¨¼ÁÐ
ªÚmÚ?>¦‘ôwÖÁ™j)ª§·SVTÛÍ\OMÈB>ÊoÉ„LïXm5=¦îÓ“ 犲ÜtÍý>—£,žWý®Û,ê‹Ð?—”«ÿ¤¹JBÀ8í©–h4KÈ1yìHŠIga²œƒJPˆTMBîÐ"ƒ²Í(ÁÙB Z*Ë}jÌ4ì–£¡ÒŸRÎf¡8 ËîÌÔÐ*zÆŒ Öå¶+÷"¾²) $ƒZ(¤ª,Z¥(\Ř¯-cyt+Á²¶ ýNÜ0!^
Ò©«\äƒñªÐ©¥ÍæBxuì«Ð1MÍéëÉ*-ꃴwQ¨¿¦ãg«´ÝcZ-vqFÇB›â‹¾8ŒÃîá:+âÝò6Puæþ:²¢ ô ‰®¤¬ÜÂA`ZgRW4¿R=>{™?áJvº-Q×å‡ó”ŠÐ,ê“0SJe²×(e‰šAµD²2<ƒnÓ]·'1n Y<Šã`tÖ±¨üèP-£hC“S¬AÕã£çnjÍÚæLJa/¤L_汎‰ŠTœD4(MhwHŸÓdÞ¹„`W*Iˆ(BÁ m—›uR«?G(vÂ}¤ˆoTu$Ú-d%XŸº¨¹Öæø%Ðã0M=„å.êr9Â’qÕ©"!óP—r zHëcoC˜‰„œ#„÷ÕÁO‰aÁÓ…ìœT¸ÈK´?ZC2†CNÇPºÂäüP¢‚D`¢s 3á2yVp”¢zÉæ1A«¾&‡ú°ŒADSÉ*V¨W a(QÇnt´(^½r=#sÛQ[ ê]uÓðliÀ_\>¾²Yî[…Õ•-™(gÿgœd¹ÍßÁÜ
Dʺß.–!0¼T„¿òñÞ5ó·é¹24ÏÍй½kr@4ȕ۩»œÁPxÆÕòòº˜g^ý Ø@<‘‚7†q[j†¤ËÁ”‰Ë˜1mÊ W(Ðp‚a]¹›½ÁLÅõãa¿éÆ~~j)k {mý‰‰óë<ª!L)-“¬-1¯‡=ºwÊ*ÐJžÏÆYÄÌš†
ñDIEäddÈj?(g9ç+3°|Šh=¬¯ýÒ}Zê°µ*fÑy–w;¥ñs7÷î9N_Á§ØÒôÛÇaœ»Ý¼LQΩÿ ¦RÊÜ¡¢ÈèñåÄÅç–IjÆUè_´(‘1œ”mÔÎ2æ8q¡'Œ,€µ·Õá§¹Ûf~®êì¤cuƒ‹TnÉÑÐ!T·¨†âÆ"•! 1(Ϫ¾Ÿ*Oc
ÃŽóÐWæW”åN3{ Ý#¦î·¤þX«4qœÍ%dkŒh} Ã…¨Fá(.&r0äyÐK1•¼¯®V•tåp꜅ùD¨UŸy¶¢PÕÀB‚Uó«ã$_[kÇ“|ü…~ÅÞ:äÒªëq#ód\Ô>yÙHç¤cåê³€×¶òȘР—Úü"¦_x…my¦’燤€Ïýfqb®lñ,põ±ÚfÕ‚^.0¦
h¸XgyG˜uGƒw¨[pN0&¿T¥²$)ˆ˜‡-BJ1d»ç'&„¦³º:4±W9„xN°áW¨˜cIVg~Ù)Z…ÜWW‡>"þY§-Oç"®øâüTð‹‹²:}Õùçy»ñõùéliø¦tñˆóõU07»ý=„Éôü<±Âÿ|:—‰»´X'rn¾$E½ S##CèC*(\¡=C¨ÌJÔP‹ÁŠåÒç¬b&hÆFl;Ö,yêcXÚå_°ÎU—‚¶Û·Äô°1ãhˆî:U[ð¶ŸÖÃ# 4˜6ÆÕ½úo¹…ö_˜\æYëð‡Ü±wÛªˆ…£Ô„ÔlUí0˜ˆ2\DÈ-qå²´[z;W¢xØK_îöSž¿/Ž>d(ßò~&&A¢÷7ŒaßIÅ
aò,8Ñù3W|Îä„[ž£)Èa¡Á<ÂÖ
=ŒTÊ2ˆt)Œàp6F¬;.xÇq‘¯ìph§½Çœ®ÄÇC˜¯£6Ö+(E&fy7t•Ã%ú¸P‡‰6ÿX;èã!ŽO7çE¶MwNÌÃ6r4Ü&œÙ¨És¯§CžØ,&&iËçÀ¿‡}s@l=ÖëžÔî›™âÆ¡ì$/Gì‚“Ñ=1¨NTδ¨Õfµ"†F™¦´Û°ßÐÛ•Îi!û¹Dñ1ª,:gñb³±â¦·¼zîæÑ¤@2Š
UÎSZ™ÈçG©Ÿˆ|Ø0@Ìó©ÿmŸ<0(ŒâÁ•ÈIJrÊ&u¿6ä.-V^yñ x枢‡Ã¯QA³"ByÅ<&hÑ–˜çÄ
ã¨KbuþôÁÿ ŒÃï
endstream
(Note: There is also some header information before the "stream" word, telling you how long the binary data is, etc.)
What I have been doing is using a StreamReader to read the text data, and then I want to switch over and just read the binary data. Something like this:
using (FileStream fs = new FileStream(myFile, FileMode.Open))
{
fs.Seek(250205, SeekOrigin.Begin); // This jumps to the start of the data above
using (StreamReader sr = new StreamReader(fs))
{
String line = "";
while ((line = sr.ReadLine()) != "stream"); // This skips to the end of "stream"
// I want this to read the 2821 bytes AFTER stream
byte[] buffer = new byte[2821];
fs.Read(buffer, 0, buffer.Length);
}
}
When I run this, buffer does not contain the 2821 bytes after "stream". Instead, it contains data that is 1024 bytes AFTER the original Seek() position.
In other words: I Seeked to position 250205. Then, I read a few lines (a total of 57 bytes). But my buffer starts at position 251229 (inestead of 250262).
How can I get the FileStream to start reading where the StreamReader left off?
(My main reason for doing it this way is because FileStream does not have a ReadLine() method and StreamReader does not have a byte[] = Read() method.)