18

I am working on a project with an Internet connection through satellite with only 130kB/day (if I use more it is very expensive).

I want to send as much "useful" data as possible every day, while staying below 130kB.

I read here (How are filenames stored?) and here (Doesn't metadata occupy any size?) that metadata are stored in a dedicated part of the file system, but it is not clear for me how much Bytes it will "cost" to send it.

If I use FTP for example, does it depend on the source filesystem? On the server filesystem? Or is it related to the FTP protocol?

Speaking about transfer protocol, what is the most cost-effective one? I googled a bit and it seems that each protocol consumes bits and bytes for handshaking, data integrity checks, etc. but I did not clearly find which one is the most economic, and how much bytes are required for the management of the protocol itself.

I read also about block size. Is this problem relevant for data transfer or is it just for data storage (in the later case it is not a problem)?

[EDIT 2023-11-08 11:00]

I am already working on data selection, data compression, error handling, etc. I am more familiar with these subjects, I did not mention them in this question because I don't need help for the moment, and if it is the case in the future I will ask a separate question.

I have 130kB/day, lets say that 30kB is used by the protocol itself. My question is not how to format my data so I can send as many values as possible within 100kB, my question is: is it really 30kB? More? Less? Of course it depends. But it depends on what? On my original question I listed some ideas I add, I need your experience to know if I missed something and/or to help me to narrow my research toward light solutions.

Elements of context:

It is for autonomous instruments deployed in Antarctica. No Lora-related solution is possible there.

Data to be sent is status and measurement data of the instruments. Data is stored locally and retrieved "physically" once a year. Data is use to see if some instrument's parameter should be modified, to do some pre-analysis and to prepare yearly maintenance.

If one day of data is missed or not completed, it is not too problematic, it should not be sent the next day.

Blacksad
  • 291

7 Answers7

19

The way to get the most data possible out of your 130kB/day is to eliminate as many layers of protocol as possible. FTP provides features like filenames, permissions, directory structure, authentication. You probably don't need those features. The question becomes, exactly how much can we trim before we start to have problems?

A good starting point would be replacing FTP with HTTP. There is some overhead, but it's pretty minimal. As an example, I just tried an HTTP request with curl and HTTP added a 771 byte overhead in the form of headers. You can optimize this further if you want. Note that in addition to that 771 byte overhead from HTTP, there is some overhead from TCP, since HTTP runs on top of TCP.

A better option would be to just send the file over TCP directly. TCP has some overhead. This source puts it at about 2.74% (including IPv4 headers). If you send the file over TCP directly, no metadata about the file will be transmitted. You won't know the original filename. That is probably fine though. You can just name it based on what time it was received.

If you want to save a little more, you can use UDP. This would take work, but it could help you get that 2.74% number a little lower.

If you want to save even more, you can use raw IP sockets. They are functionally identical to UDP, except without port numbers or checksums and with 8 less bytes of overhead per packet. The lack of port numbers however means they can't traverse NAT. This would require your remote measurement computer to have a public IP address of it's own, which it probably does not have. You might could make it work, but I don't think it would be worth the effort vs UDP.

As others have pointed out, your biggest savings potential comes from switching from sending files to sending data. From a comment, you said:

My instrument's original daily file is 80MB large (1 line / sec, 95 columns of floats)

Assuming you mean 32-bit floats, that file should be

[4 bytes per float]*[95 columns]*[86400 seconds in a day] = 31.31MiB

Maybe you are storing 64-bit floats (do you really need that level of precision)?

[8 bytes per float]*[95 columns]*[86400 seconds in a day] = 62.62MiB

From the way you have been talking I am guessing that these values don't normally change quickly. Maybe you could send high precision values for the first row, then send smaller deltas for each subsequent row. If you are willing to post one of your data files, I would be interested to take a crack at seeing how small I could get it.

9072997
  • 641
10

Can't comment yet, so here are my pointers:

130kb/day is probably too limited for a lot of file based transfers, but can be used rather efficiently in other ways if you constraint yourself a bit more. Research regarding middleware and low level protocols is probably more relevant to this case than generic file transfer. Another domain with this kind of problem is remote IoT devices, LORA (or LORAWAN) could be of interest of you.

Another angle to tackle this problem would be to lean on shared knowledge. Things like differential transfers (skipping default entries) and lookup tables for possible messages would reduce the actual bandwidth to a minimum, but will require a good understanding and encoding of your communication. Protocol Buffers are one solution for this kind of problem.

Don't forget to count in error correction. It will increase raw size, but prevents the huge latency of a resend with less reliable transports.

9

I read here (How are filenames stored?) and here (Doesn't metadata occupy any size?) that metadata are stored in a dedicated part of the file system, but it is not clear for me how much Bytes it will "cost" to send it

File metadata is not some opaque thing that you just get from a filesystem and "send". It's a set of individual parameters that you pick and choose – different protocols send different sets of metadata, and if you were to create your own software for transferring files, you get to decide which fields you want to send and you decide when and how to send them.

The largest kind of metadata – the layout that describes how the file is physically stored on disk – is not only filesystem-dependent but is also completely internal. That is, while you can ask the filesystem about the list of file extents, that's not the kind of metadata you ever need to transfer (or even look at); you just read the file from start to end, and the receiver's filesystem decides on its own storage layout.

Most other fields are either small (such as timestamps) or optional and not necessary to transfer (such as file permissions, which e.g. SMB or NFS will transfer but HTTP will not – and you certainly do not need to, either).

Finally, since this is multiple fields and not just a single opaque chunk of data, the total size also highly depends on how you choose to arrange those fields. For example, do you send the modification time as a textual date or as a 64-bit nanosecond field or as decimal seconds or as a varint or do you just not send it at all?

That is to say, it's difficult if not impossible to give you a ready-to-use "which protocol is best" answer at this stage. You need to spend a few moments studying network protocol designs; at minimum, you should look at how some of those protocols work – their specifications or packet captures – to get a rough idea of how "metadata" works.

If I use FTP for example, does it depend on the source filesystem? On the server filesystem? Or is it related to the FTP protocol?

The source or destination filesystem generally do not matter when using network file transfer protocols, as the entire purpose of such protocols is to abstract away the specifics of the underlying file storage and to define exactly what is sent over the network.

When a client is talking to an FTP server, it knows nothing about the underlying filesystem (and it might not even be a real filesystem; the FTP server could just as well present a MySQL table view as "files"...), all it exchanges is FTP commands – and it only transfers the metadata fields that are defined in FTP.

I read also about block size. Is this problem relevant for data transfer or is it just for data storage (in the later case it is not a problem)?

Both. For example, some transfer protocols apply a checksum to each block (see e.g. XMODEM for a commonly used example); this will slightly increase the total amount of data if everything goes well, but at the same time will massively reduce the amount of data if the link quality is poor and some blocks need to be retransmitted (which will be cheaper than resending the whole file). It's a tradeoff that you adjust depending on your specific needs.

(In this case you can usually assume 'block' and 'packet' as roughly the same. The block size is defined by the transfer protocol used and has nothing to do with the storage.)

grawity
  • 501,077
6

Knowing a bit about instrumentation but nothing about your instrument, there's almost always stuff you can do before compression.

For example:

  • You can use a smaller but less precise datatype, perhaps using a datum value from which you send the difference - or just send the difference from the previous reading - if this is known to be small, you may be able to use a smaller (even custom) datatype.
  • You can average datapoints in a column rather than just picking every nth point.
    • This means the SNR of the data you have to work with is better,
    • It also potentially improves compressibility as noise doesn't compress well and you're reducing the noise.
  • You can obviously select the most meaningful columns, and that heavily depends on what you want to know. It may not be a trivial decision. Say you're reading a bunch of temperatures - you might want daily or hourly averages for every sensor every day, but each day pick a different sensor to send at 1s resolution to look for short-term rises in temperature.
  • You may even find that transposing the data allows it to compress better. That's certainly the case for crude compression methods on ASCII data as runs of identical or even similar values compress more than cycling through widely differing data. So instead of sending records of (for example) timestamp, latitude, longitude, altitude, temperature you send a list of timestamps, a list of latitudes, etc. You would need to test this. Depending on your error correction you should even test on a noisy channel; if you merely detect and skip errors, would you rather lose a column or a row (in source data coordinates)?
Chris H
  • 2,245
5

For metadata, there are two types : File-system and in-built.

The file-system metadata includes date of creation, owning user and more, but usually this remains in the source file-system and is created anew on the target file-system. In short, it's usually not transferred, only the file-data is transferred. However, if you transfer an archive, such as Zip, the metadata is included in the archive.

The in-built metadata is included in some files such as Office files, and can include details about the author, the data and more. This data is transmitted with the document itself and is indivisible.

If you wish to use the bandwidth to the maximum, the protocol itself is less important, and can be FTP, FTPS or SFTP or other as needed. It is much more important to reduce the amount of data that is to be transferred.

You may do this by the obvious method of limiting the data to be transferred, but can also use compression methods to reduce the size of the data. Zip is the older compression method, but 7Zip is newer and more efficient in most cases.

See the post What are the best options to use when compressing files using 7 Zip?

My answer in this post shows that the best compression parameters vary according to the type of the data that is involved. To find the most efficient parameters, I have used the 7-ZIP finetuner. This tool hunts for the optimal parameters by simply repeating the compression with varying parameters looking for the optimal combination. You may use it on your data to find the best parameters.

Note that data that is already compressed cannot be further compressed. There is no point in compressing files such as Zip archives or Office documents.

harrymc
  • 498,455
5

Which protocol to send the most possible data through satellite?

Since you present your question as an absolute ("most"):

A custom protocol that transmits compressed raw binary data in its smallest (least precise) acceptable form and omits any overheads like file metadata. This might mean serialising data into variable length data types. You'd probably have to experiment with many different approaches using a large body of representative data.

You could maybe use UDP and roll your own checking algorithm for missing, duplicate or out of sequence packets but it might be best to start with TCP.

I would include a checksum.

Obviously, for a custom protocol you have to write both client and server software and make your own assessment of the implications for security etc.

0

Have you considered using MQTT Protocol? It is perfect for M2M communications because it is designed as an extremely lightweight publish/subscribe messaging transport that is ideal for connecting remote devices with a small code footprint and minimal network bandwidth.

Here are a couple of open source brokers:

Eclipse Mosquitto

RabbitMQ

Apache ActiveMQ

Apache Kafka

ZeroMQ

NSQ