Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • wdata wdata
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 4
    • Issues 4
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wtools
  • wdatawdata
  • Issues
  • #15

Closed
Open
Created Jun 21, 2021 by Michael Forbes@mforbesMaintainer

Multifile format.

I propose that we support a multi-file format whereby all files with the names <prefix>_<varname>.*.<ext> represent a contiguous set of additional data to be concatenated along the first axis to the data in <prefix>_<varname>.<ext>. Formally the specification would be:

  1. Gather all files <prefix>_<varname>.<ext> and <prefix>_<varname>.*.<ext>.
  2. Sort them using a "natural" sorting algorithm (more on this below).
  3. Concatenate the data along the first (time) axis.
  4. If the time abscissa are specified <prefix>__t.<ext> then corresponding files <prefix>__t.*.<ext> would be required (otherwise it is an error).

The sort order should be something akin to what is sometimes called a natural sort order.

The following from might provide a simple reference implementation modified from that of Ned Batchelder:

"""Utilities."""
from decimal import Decimal
import os.path
import re


filename_regexp = re.compile(
    r"""
    (?P<prefix>[^\.]+)\.             # Prefix
    (
        (?P<decimal>-?\d*(\.\d+)?)?  # Optional Decimal
        (?P<remainder>.*)\.          # Remainder
    )?                               # This portion is optional
    (?P<ext>\w*)                     # Extension
""",
    re.VERBOSE,
)


def filename_sort_key(filename):
    """Return the sort key for filename.


    Examples
    --------
    >>> filename_sort_key("pre.3.npy")
    ['pre', Decimal('3'), '', 'npy']
    >>> filename_sort_key("pre.3a.npy")
    ['pre', Decimal('3'), 'a', 'npy']
    >>> filename_sort_key("pre.3.a.npy")
    ['pre', Decimal('3'), '.a', 'npy']
    """
    match = filename_regexp.match(os.path.basename(filename))
    if not match:
        raise ValueError(
            f"Invalid filename {filename}. "
            + "Must have form  <prefix>.[<decimal>]<str>.<ext>."
        )

    prefix, decimal, remainder, ext = map(
        match.group, ["prefix", "decimal", "remainder", "ext"]
    )
    if not decimal:
        decimal = "-inf"  # Make sure that empty decimals sort first
    decimal = Decimal(decimal)

    # This is the Human sorting algorithm from
    # https://nedbatchelder.com/blog/200712/human_sorting.html
    if remainder:
        remainder = [
            int(c) if c.isdigit() else c for c in re.split("([0-9]+)", remainder)
        ]
    else:
        remainder = [""]

    return [prefix, decimal] + remainder + [ext]


def sort_filenames(filenames):
    """Sort filenames in a natural order.

    We assume that the filenames have the following form::

        <prefix>.<ext>                   # Always comes first
        <prefix>.[<decimal>]<str>.<ext>

    Filenames without a decimal will sort before those with a decimal.

    This splits the strings into a sequence of digits and letters, then interprets the
    digits as integers for the purpose of comparison.

    Examples
    --------
    >>> files = ["pre.wdat", "pre.a.ext", "pre.-1.3.wdat", "pre.-1.2.wdat",
    ...          "pre.0.wdat", "pre.1.wdat", "pre.002.wdat",
    ...          "pre.3.wdat", "pre.3.0014.wdat", "pre.3.013.wdat", "pre.3.0131.wdat",
    ...          "pre.4.wdat", "pre.9.wdat", "pre.10.wdat", "pre.11.wdat",
    ...          "pre.99.wdat", "pre.100.wdat", "pre.3.a.npy"]
    >>> print("\\n".join(sort_filenames(files)))
    pre.wdat
    pre.a.ext
    pre.-1.3.wdat
    pre.-1.2.wdat
    pre.0.wdat
    pre.1.wdat
    pre.002.wdat
    pre.3.wdat
    pre.3.a.npy
    pre.3.0014.wdat
    pre.3.013.wdat
    pre.3.0131.wdat
    pre.4.wdat
    pre.9.wdat
    pre.10.wdat
    pre.11.wdat
    pre.99.wdat
    pre.100.wdat

    References
    ----------
    * https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
    * https://nedbatchelder.com/blog/200712/human_sorting.html
    """
    dirs = set(map(os.path.dirname, filenames))
    if not len(dirs) == 1:
        raise ValueError(
            f"Files must all be in the same directory: got {len(dirs)}: {dirs}"
        )

    return sorted(filenames, key=filename_sort_key)

The sorting criteria means that the principal file <prefix>_<varname>.<ext> always comes first, and also allows for numerical sorting (which is backwards compatible with our prior format).

In practise, we should have a simple format for adding new data, probably just using numbers

pre.wdat
pre.0.wdat
pre.1.wdat
pre.2.wdat
...
pre.9.wdat
pre.10.wdat
...

or possibly using zero padding:

pre.wdat
pre.0000.wdat
pre.0001.wdat
pre.0002.wdat
...
pre.0009.wdat
pre.0010.wdat
...
Edited Jun 21, 2021 by Michael Forbes
Assignee
Assign to
Time tracking