Converting Open data without a ready-made adapter¶

Sometimes you spot an interesting public dataset (a Zenodo upload, a Kaggle competition, a Figshare release) that isn't yet represented under datasets/ in the monorepo. This page walks through the manual conversion to a valid Telemachus Telemachus dataset.

0. Before you start — check the license¶

Run the license test from Writing an adapter → Licensing pitfall. A CC-BY-NC-ND source forbids republishing derivatives, so you can ship the adapter code but not the converted parquet. Everything that follows assumes you're allowed to keep a local copy for your analysis.

1. Inventory the source¶

Open the raw files and identify, column by column, what each sensor stream actually contains:

Source column	Unit	Rate	Maps to Telemachus format
`timestamp_ms`	ms since epoch	—	`ts` (convert to UTC datetime)
`latitude`	deg	1 Hz	`lat`
`longitude`	deg	1 Hz	`lon`
`speed_kmh`	km/h	1 Hz	`speed_mps` (÷ 3.6)
`accel_x_g`	g	100 Hz	`ax_mps2` (× 9.80665)
`accel_y_g`	g	100 Hz	`ay_mps2`
`accel_z_g`	g	100 Hz	`az_mps2`
`gyro_x_dps`	deg/s	100 Hz	`gx_rad_s` (× π/180)
…

If any expected Telemachus column is missing, plan how you'll handle it:

GPS columns absent at IMU rate → leave as NaN (multi-rate convention, SPEC-01 §3.5)
Heading missing → recompute from consecutive GPS points (Haversine bearing)
Gyro missing → simply leave the gyro columns absent (SPEC-01 §3.3: must be absent OR all-NaN, never zero-filled)

2. Fetch & unpack¶

Scripts go in an adapter folder under your working copy, not committed if the license is restrictive:

mkdir -p datasets/xx_my_source/
cd datasets/xx_my_source/
cat > download.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
mkdir -p raw
curl -sSL -o raw/data.zip "https://example.org/dataset.zip"
cd raw && unzip -q data.zip && cd ..
echo "raw files under datasets/xx_my_source/raw/"
EOF
chmod +x download.sh

3. Write the adapter¶

See the full template in Writing an adapter. The minimum your adapter must do:

Read the raw files (CSV / Parquet / whatever).
Rename & convert columns (units!) to the Telemachus names.
Sort by ts, ensure monotonicity, drop or fix duplicates.
Write the Telemachus parquet.
Emit a manifest.yaml declaring hardware, sensors.*.rate_hz, acc_periods (start/end/frame), source block, license.

4. Detect the accelerometer frame¶

Critical step often forgotten. Run this check on a known stationary segment (device on a table, vehicle parked + engine off):

import numpy as np, pandas as pd
df = pd.read_parquet("data.parquet")

# Pick a stationary window (first 10 seconds for instance)
rest = df.iloc[:int(10 * 100)]  # 10 s at 100 Hz
a_norm = np.sqrt(rest["ax_mps2"]**2 + rest["ay_mps2"]**2 + rest["az_mps2"]**2)
mean_g = a_norm.mean()
print(f"||a|| at rest: {mean_g:.2f} m/s²")

if mean_g > 8:
    frame = "raw"           # gravity present
elif mean_g < 2:
    frame = "compensated"   # firmware-stripped
else:
    frame = "partial"
print(f"→ frame = {frame}")

Put the result in manifest.yaml under acc_periods:

acc_periods:
  - start: 2024-01-01T00:00:00Z
    end:   2024-12-31T23:59:59Z
    frame: raw               # or compensated / partial
    detection_method: auto
    residual_g: 0.0          # only if frame=partial

5. Validate¶

Both artefacts must pass:

# Validate the dataset (manifest + parquet)
tele validate datasets/xx_my_source/ --level full

6. Submit (optional)¶

If the license permits redistribution, you can:

Open a PR adding your adapter under python-cli/adapters/.
Add the manifest.yaml under datasets/xx_my_source/.
If the raw volume is < 10 MB, commit the parquet too. Otherwise use git-lfs for files > 10 MB, or publish the parquet on Zenodo and reference it in the manifest.

See Open sources matrix for the current coverage to avoid overlap.