Multifile format.
I propose that we support a multi-file format whereby all files with the names <prefix>_<varname>.*.<ext>
represent a contiguous set of additional data to be concatenated along the first axis to the data in <prefix>_<varname>.<ext>
. Formally the specification would be:
- Gather all files
<prefix>_<varname>.<ext>
and<prefix>_<varname>.*.<ext>
. - Sort them using a "natural" sorting algorithm (more on this below).
- Concatenate the data along the first (time) axis.
- If the time abscissa are specified
<prefix>__t.<ext>
then corresponding files<prefix>__t.*.<ext>
would be required (otherwise it is an error).
The sort order should be something akin to what is sometimes called a natural sort order.
The following from might provide a simple reference implementation modified from that of Ned Batchelder:
"""Utilities."""
from decimal import Decimal
import os.path
import re
filename_regexp = re.compile(
r"""
(?P<prefix>[^\.]+)\. # Prefix
(
(?P<decimal>-?\d*(\.\d+)?)? # Optional Decimal
(?P<remainder>.*)\. # Remainder
)? # This portion is optional
(?P<ext>\w*) # Extension
""",
re.VERBOSE,
)
def filename_sort_key(filename):
"""Return the sort key for filename.
Examples
--------
>>> filename_sort_key("pre.3.npy")
['pre', Decimal('3'), '', 'npy']
>>> filename_sort_key("pre.3a.npy")
['pre', Decimal('3'), 'a', 'npy']
>>> filename_sort_key("pre.3.a.npy")
['pre', Decimal('3'), '.a', 'npy']
"""
match = filename_regexp.match(os.path.basename(filename))
if not match:
raise ValueError(
f"Invalid filename {filename}. "
+ "Must have form <prefix>.[<decimal>]<str>.<ext>."
)
prefix, decimal, remainder, ext = map(
match.group, ["prefix", "decimal", "remainder", "ext"]
)
if not decimal:
decimal = "-inf" # Make sure that empty decimals sort first
decimal = Decimal(decimal)
# This is the Human sorting algorithm from
# https://nedbatchelder.com/blog/200712/human_sorting.html
if remainder:
remainder = [
int(c) if c.isdigit() else c for c in re.split("([0-9]+)", remainder)
]
else:
remainder = [""]
return [prefix, decimal] + remainder + [ext]
def sort_filenames(filenames):
"""Sort filenames in a natural order.
We assume that the filenames have the following form::
<prefix>.<ext> # Always comes first
<prefix>.[<decimal>]<str>.<ext>
Filenames without a decimal will sort before those with a decimal.
This splits the strings into a sequence of digits and letters, then interprets the
digits as integers for the purpose of comparison.
Examples
--------
>>> files = ["pre.wdat", "pre.a.ext", "pre.-1.3.wdat", "pre.-1.2.wdat",
... "pre.0.wdat", "pre.1.wdat", "pre.002.wdat",
... "pre.3.wdat", "pre.3.0014.wdat", "pre.3.013.wdat", "pre.3.0131.wdat",
... "pre.4.wdat", "pre.9.wdat", "pre.10.wdat", "pre.11.wdat",
... "pre.99.wdat", "pre.100.wdat", "pre.3.a.npy"]
>>> print("\\n".join(sort_filenames(files)))
pre.wdat
pre.a.ext
pre.-1.3.wdat
pre.-1.2.wdat
pre.0.wdat
pre.1.wdat
pre.002.wdat
pre.3.wdat
pre.3.a.npy
pre.3.0014.wdat
pre.3.013.wdat
pre.3.0131.wdat
pre.4.wdat
pre.9.wdat
pre.10.wdat
pre.11.wdat
pre.99.wdat
pre.100.wdat
References
----------
* https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
* https://nedbatchelder.com/blog/200712/human_sorting.html
"""
dirs = set(map(os.path.dirname, filenames))
if not len(dirs) == 1:
raise ValueError(
f"Files must all be in the same directory: got {len(dirs)}: {dirs}"
)
return sorted(filenames, key=filename_sort_key)
The sorting criteria means that the principal file <prefix>_<varname>.<ext>
always comes first, and also allows for numerical sorting (which is backwards compatible with our prior format).
In practise, we should have a simple format for adding new data, probably just using numbers
pre.wdat
pre.0.wdat
pre.1.wdat
pre.2.wdat
...
pre.9.wdat
pre.10.wdat
...
or possibly using zero padding:
pre.wdat
pre.0000.wdat
pre.0001.wdat
pre.0002.wdat
...
pre.0009.wdat
pre.0010.wdat
...