File
File is a special DataModel,
which is automatically generated when a DataChain is created from files,
such as in dc.read_storage:
import datachain as dc
chain = dc.read_storage("gs://datachain-demo/dogs-and-cats")
chain.print_schema()
Output:
file: File@v1
source: str
path: str
size: int
version: str
etag: str
is_latest: bool
last_modified: datetime
location: Union[dict, list[dict], NoneType]
File classes include various metadata fields describing the underlying file,
along with methods to read and manipulate file contents.
File
Bases: DataModel
DataModel for reading binary files.
Attributes:
-
source(str) βThe source of the file (e.g., 's3://bucket-name/').
-
path(str) βThe path to the file (e.g., 'path/to/file.txt').
-
size(int) βThe size of the file in bytes. Defaults to 0.
-
version(str) βThe version of the file. Defaults to an empty string.
-
etag(str) βThe ETag of the file. Defaults to an empty string.
-
is_latest(bool) βWhether the file is the latest version. Defaults to
True. -
last_modified(datetime) βThe last modified timestamp of the file. Defaults to Unix epoch (
1970-01-01T00:00:00). -
location(dict | list[dict]) βThe location of the file. Defaults to
None.
Source code in datachain/lib/file.py
name
property
name: str
parent
property
parent: str
as_audio_file
as_audio_file() -> AudioFile
Convert the file to a AudioFile object.
Source code in datachain/lib/file.py
as_image_file
as_image_file() -> ImageFile
Convert the file to a ImageFile object.
Source code in datachain/lib/file.py
as_text_file
as_text_file() -> TextFile
Convert the file to a TextFile object.
Source code in datachain/lib/file.py
as_video_file
as_video_file() -> VideoFile
Convert the file to a VideoFile object.
Source code in datachain/lib/file.py
at
classmethod
Construct a File from a full URI in one call.
Parameters:
-
uri(str | PathLike[str]) βFull URI or path to the file (e.g.
s3://bucket/path/to/file.pngor/local/path). -
session(Session | None, default:None) βOptional session instance. If None, the current session is used.
Returns:
-
File(Self) βA new File object pointing to the given URI.
Source code in datachain/lib/file.py
ensure_cached
Download the file to the local cache.
get_local_path can be used to return the path to the cached copy on disk.
This is useful when you need to pass the file to code that expects a local
filesystem path (e.g. ffmpeg, opencv, pandas, etc).
Source code in datachain/lib/file.py
export
export(
output: str | PathLike[str],
placement: ExportPlacement = "fullpath",
use_cache: bool = True,
link_type: Literal["copy", "symlink"] = "copy",
client_config: dict | None = None,
) -> None
Copy or link this file into an output directory.
Parameters:
-
output(str | PathLike[str]) βDestination directory. Accepts a local OS path, a cloud prefix fsspec URI (
s3://β¦,gs://β¦,az://β¦). -
placement(ExportPlacement, default:'fullpath') βHow to build the path under output:
"fullpath"(default) βoutput/bucket/dir/file.txt"filepath"βoutput/dir/file.txt"filename"βoutput/file.txt"etag"βoutput/<etag>.txt
-
use_cache(bool, default:True) βIf True, download to local cache first. Also required for symlinking remote files.
-
link_type(Literal['copy', 'symlink'], default:'copy') β"copy"(default) or"symlink". Symlink falls back to copy for virtual files and for remote files when use_cache is False. -
client_config(dict | None, default:None) βExtra kwargs forwarded to the storage client.
Example
# flat export by filename
f.export("./export", placement="filename")
# export to a cloud prefix
f.export("s3://output-bucket/results", placement="filepath")
# pass storage credentials via client_config
f.export("s3://bucket/out", client_config={"aws_access_key_id": "β¦"})
# symlink from local cache (avoids re-downloading)
f.export("./local_out", use_cache=True, link_type="symlink")
Source code in datachain/lib/file.py
651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 | |
get_fs_path
get_fs_path() -> str
Combine source and path into the full location string.
For cloud backends the result is a full URI-like (s3://β¦, gs://β¦).
For local files the result is a bare OS path (no file://
prefix).
Examples:
# Cloud (S3, GCS, Azure, β¦)
f = File(source="s3://my-bucket", path="data/image.jpg")
f.get_fs_path() # 's3://my-bucket/data/image.jpg'
# Local files
f = File(source="file:///home/user/project", path="out/result.csv")
f.get_fs_path() # '/home/user/project/out/result.csv'
# No source β returns the relative path as-is
f = File(source="", path="dir/file.txt")
f.get_fs_path() # 'dir/file.txt'
Raises:
-
FileErrorβIf
pathis empty, ends with/, or contains./..segments. For local files, also rejects absolute paths, drive letters, and empty segments.
Source code in datachain/lib/file.py
get_local_path
get_local_path() -> str | None
Return path to a file in a local cache.
Returns None if file is not cached. Raises an exception if cache is not setup.
Source code in datachain/lib/file.py
get_uri
get_uri() -> str
Deprecated: Return a URI-like string for this file.
Source code in datachain/lib/file.py
open
open(
mode: str = "rb",
*,
client_config: dict[str, Any] | None = None,
**open_kwargs
) -> Iterator[Any]
Open the file and return a file-like object.
Supports both read ("rb", "r") and write modes (e.g. "wb", "w", "ab"). When opened in a write mode, metadata is refreshed after closing.
Source code in datachain/lib/file.py
read
read(length: int = -1)
read_bytes
read_bytes(length: int = -1)
read_text
Return file contents decoded as text.
**open_kwargs : Any
Extra keyword arguments forwarded to open(mode="r", ...)
(e.g. encoding="utf-8", errors="ignore")
Source code in datachain/lib/file.py
rebase
Rebase the file's URI from one base directory to another.
Parameters:
-
old_base(str) βBase directory to remove from the file's URI
-
new_base(str) βNew base directory to prepend
-
suffix(str, default:'') βOptional suffix to add before file extension
-
extension(str, default:'') βOptional new file extension (without dot)
Returns:
-
str(str) βRebased URI with new base directory
Raises:
-
ValueErrorβIf old_base is not found in the file's URI
Examples:
file = File(source="s3://bucket", path="data/2025-05-27/file.wav")
file.rebase("s3://bucket/data", "s3://output-bucket/processed",
extension="mp3")
# 's3://output-bucket/processed/2025-05-27/file.mp3'
file.rebase("data/audio", "/local/output", suffix="_ch1",
extension="npy")
# '/local/output/file_ch1.npy'
Source code in datachain/lib/file.py
resolve
Resolve a File object by checking its existence and updating its metadata.
Returns:
-
File(Self) βThe resolved File object with updated metadata.
Source code in datachain/lib/file.py
save
Write file contents to destination.
Parameters:
-
destination(str) βTarget path or URI. Accepts a local OS path, a cloud URI (
s3://β¦,gs://β¦,az://β¦), or an unencodedfile://URI as produced by :func:~datachain.fs.utils.path_to_fsspec_uri. Do not passPath.as_uri()output β that produces RFC percent-encoded URIs (e.g.file:///my%20dir) which fsspec does not decode, causingFileNotFoundErrorfor paths containing spaces,#, or%. -
client_config(dict | None, default:None) βOptional extra kwargs forwarded to the storage client (e.g. credentials, endpoint URL).
Example
Source code in datachain/lib/file.py
upload
classmethod
Upload bytes to a storage path and return a File pointing to it.
Parameters:
-
data(bytes) βThe raw bytes to upload.
-
path(str | PathLike[str]) βDestination path (local or remote, e.g.
s3://bucket/file.txt). -
catalog(Catalog | None, default:None) βOptional catalog instance. If None, the current session catalog is used.
Returns:
-
File(Self) βA new File object with metadata populated from the upload.
Note
To write data as a stream, use File.at with open instead:
Source code in datachain/lib/file.py
FileError
TarVFile
Bases: VFile
Virtual file model for files extracted from tar archives.
open
classmethod
Stream file from tar archive based on location in archive.