We are currently using Box as our cloud file system, but the workflow sometimes takes forever to run due to the number of files it reads, and the connector and the number of folders also cause it to stop responding often.
We also have a Windows server that is accessible from the Hub, but from what I read, a shared folder there wouldn’t be accessible from the Hub.
I’m looking for options and suggestions to make this process easier, like a local file system inside the container (I don’t even know if this is possible), or else.
Anyway, I would appreciate any input on the topic.
There is no “local filesystem inside the container” you can rely on for large, shared datasets, container disk disappears between runs, between pods, and when the executor scales.
Because of that, using Box + lots of small files is unfortunately a very tough combination for good performance. Every small file can mean an extra network call that adds up very quickly — even if you give the job much more CPU.
What usually works much better:
Combine the many small files into fewer, bigger files:
Parquet files (very efficient for data workflows)
ORC files
One big ZIP or TAR archive
Or load everything into a database table → Reading a few large files is faster than reading thousands of tiny ones.
Move the data from Box to cloud object storage:
Amazon S3 (or any S3-compatible storage)
Azure Blob Storage
This would be much more performant and stable than Box for this kind of workload.