Method: Polars::IO#scan_csv

Defined in:
lib/polars/io/csv.rb

#scan_csv(source, has_header: true, separator: ",", comment_prefix: nil, quote_char: '"', skip_rows: 0, skip_lines: 0, schema: nil, schema_overrides: nil, null_values: nil, missing_utf8_is_empty_string: false, ignore_errors: false, cache: true, with_column_names: nil, infer_schema: true, infer_schema_length: N_INFER_DEFAULT, n_rows: nil, encoding: "utf8", low_memory: false, rechunk: false, skip_rows_after_header: 0, row_index_name: nil, row_index_offset: 0, try_parse_dates: false, eol_char: "\n", new_columns: nil, raise_if_empty: true, truncate_ragged_lines: false, decimal_comma: false, glob: true, storage_options: nil, credential_provider: "auto", retries: nil, file_cache_ttl: nil, include_file_paths: nil) ⇒ LazyFrame

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Parameters:

  • source (Object)

    Path to a file.

  • has_header (Boolean) (defaults to: true)

    Indicate if the first row of dataset is a header or not. If set to false, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset starting at 1.

  • separator (String) (defaults to: ",")

    Single byte character to use as separator in the file.

  • comment_prefix (String) (defaults to: nil)

    A string used to indicate the start of a comment line. Comment lines are skipped during parsing. Common examples of comment prefixes are # and //.

  • quote_char (String) (defaults to: '"')

    Single byte character used for csv quoting. Set to nil to turn off special handling and escaping of quotes.

  • skip_rows (Integer) (defaults to: 0)

    Start reading after skip_rows lines. The header will be parsed at this offset.

  • skip_lines (Integer) (defaults to: 0)

    Start reading after skip_lines lines. The header will be parsed at this offset. Note that CSV escaping will not be respected when skipping lines. If you want to skip valid CSV rows, use skip_rows.

  • schema (Object) (defaults to: nil)

    Provide the schema. This means that polars doesn't do schema inference. This argument expects the complete schema, whereas schema_overrides can be used to partially overwrite a schema. Note that the order of the columns in the provided schema must match the order of the columns in the CSV being read.

  • schema_overrides (Object) (defaults to: nil)

    Overwrite dtypes for specific or all columns during schema inference.

  • null_values (Object) (defaults to: nil)

    Values to interpret as null values. You can provide a:

    • String: All values equal to this string will be null.
    • Array: All values equal to any string in this array will be null.
    • Hash: A hash that maps column name to a null value string.
  • missing_utf8_is_empty_string (Boolean) (defaults to: false)

    By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param true.

  • ignore_errors (Boolean) (defaults to: false)

    Try to keep reading lines if some lines yield errors. First try infer_schema_length: 0 to read all columns as :str to check which values might cause an issue.

  • cache (Boolean) (defaults to: true)

    Cache the result after reading.

  • with_column_names (Object) (defaults to: nil)

    Apply a function over the column names. This can be used to update a schema just in time, thus before scanning.

  • infer_schema (Boolean) (defaults to: true)

    When true, the schema is inferred from the data using the first infer_schema_length rows. When false, the schema is not inferred and will be Polars::String if not specified in schema or schema_overrides.

  • infer_schema_length (Integer) (defaults to: N_INFER_DEFAULT)

    Maximum number of lines to read to infer schema. If set to 0, all columns will be read as :str. If set to nil, a full table scan will be done (slow).

  • n_rows (Integer) (defaults to: nil)

    Stop reading from CSV file after reading n_rows.

  • encoding ("utf8", "utf8-lossy") (defaults to: "utf8")

    Lossy means that invalid utf8 values are replaced with characters.

  • low_memory (Boolean) (defaults to: false)

    Reduce memory usage in expense of performance.

  • rechunk (Boolean) (defaults to: false)

    Reallocate to contiguous memory when all chunks/ files are parsed.

  • skip_rows_after_header (Integer) (defaults to: 0)

    Skip this number of rows when the header is parsed.

  • row_index_name (String) (defaults to: nil)

    If not nil, this will insert a row count column with the given name into the DataFrame.

  • row_index_offset (Integer) (defaults to: 0)

    Offset to start the row_count column (only used if the name is set).

  • try_parse_dates (Boolean) (defaults to: false)

    Try to automatically parse dates. If this does not succeed, the column remains of data type :str.

  • eol_char (String) (defaults to: "\n")

    Single byte end of line character.

  • new_columns (Array) (defaults to: nil)

    Provide an explicit list of string column names to use (for example, when scanning a headerless CSV file). If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.

  • raise_if_empty (Boolean) (defaults to: true)

    When there is no data in the source, NoDataError is raised. If this parameter is set to false, an empty LazyFrame (with no columns) is returned instead.

  • truncate_ragged_lines (Boolean) (defaults to: false)

    Truncate lines that are longer than the schema.

  • decimal_comma (Boolean) (defaults to: false)

    Parse floats using a comma as the decimal separator instead of a period.

  • glob (Boolean) (defaults to: true)

    Expand path given via globbing rules.

  • storage_options (Hash) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

    The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

    • aws
    • gcp
    • azure
    • Hugging Face (hf://): Accepts an API key under the token parameter: \ {'token': '...'}, or by setting the HF_TOKEN environment variable.

    If storage_options is not provided, Polars will try to infer the information from environment variables.

  • credential_provider (Object) (defaults to: "auto")

    Provide a function that can be called to provide cloud storage credentials. The function is expected to return a hash of credential keys along with an optional credential expiry time.

  • retries (Integer) (defaults to: nil)

    Number of retries if accessing a cloud instance fails.

  • file_cache_ttl (Integer) (defaults to: nil)

    Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the POLARS_FILE_CACHE_TTL environment variable (which defaults to 1 hour) if not given.

  • include_file_paths (String) (defaults to: nil)

    Include the path of the source file(s) as a column with this name.

Returns:



653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
# File 'lib/polars/io/csv.rb', line 653

def scan_csv(
  source,
  has_header: true,
  separator: ",",
  comment_prefix: nil,
  quote_char: '"',
  skip_rows: 0,
  skip_lines: 0,
  schema: nil,
  schema_overrides: nil,
  null_values: nil,
  missing_utf8_is_empty_string: false,
  ignore_errors: false,
  cache: true,
  with_column_names: nil,
  infer_schema: true,
  infer_schema_length: N_INFER_DEFAULT,
  n_rows: nil,
  encoding: "utf8",
  low_memory: false,
  rechunk: false,
  skip_rows_after_header: 0,
  row_index_name: nil,
  row_index_offset: 0,
  try_parse_dates: false,
  eol_char: "\n",
  new_columns: nil,
  raise_if_empty: true,
  truncate_ragged_lines: false,
  decimal_comma: false,
  glob: true,
  storage_options: nil,
  credential_provider: "auto",
  retries: nil,
  file_cache_ttl: nil,
  include_file_paths: nil
)
  if new_columns&.any? && schema_overrides.is_a?(::Array)
    msg = "expected 'schema_overrides' hash, found #{schema_overrides.inspect}"
    raise TypeError, msg
  elsif new_columns&.any?
    if with_column_names
      msg = "cannot set both `with_column_names` and `new_columns`; mutually exclusive"
      raise ArgumentError, msg
    end
    if schema_overrides && schema_overrides.is_a?(::Array)
      schema_overrides = new_columns.zip(schema_overrides).to_h
    end

    # wrap new column names as a callable
    with_column_names = lambda do |cols|
      if cols.length > new_columns.length
        new_columns + cols[new_columns.length..]
      else
        new_columns
      end
    end
  end

  Utils._check_arg_is_1byte("separator", separator, false)
  Utils._check_arg_is_1byte("quote_char", quote_char, true)

  if Utils.pathlike?(source)
    source = Utils.normalize_filepath(source)
  end

  if !infer_schema
    infer_schema_length = 0
  end

  if !retries.nil?
    msg = "the `retries` parameter was deprecated in 0.25.0; specify 'max_retries' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["max_retries"] = retries
  end

  if !file_cache_ttl.nil?
    msg = "the `file_cache_ttl` parameter was deprecated in 0.25.0; specify 'file_cache_ttl' in `storage_options` instead."
    Utils.issue_deprecation_warning(msg)
    storage_options = storage_options || {}
    storage_options["file_cache_ttl"] = file_cache_ttl
  end

  credential_provider_builder = _init_credential_provider_builder(
    credential_provider, source, storage_options, "scan_csv"
  )

  _scan_csv_impl(
    source,
    has_header: has_header,
    separator: separator,
    comment_prefix: comment_prefix,
    quote_char: quote_char,
    skip_rows: skip_rows,
    skip_lines: skip_lines,
    schema_overrides: schema_overrides,
    schema: schema,
    null_values: null_values,
    ignore_errors: ignore_errors,
    cache: cache,
    with_column_names: with_column_names,
    infer_schema_length: infer_schema_length,
    n_rows: n_rows,
    low_memory: low_memory,
    rechunk: rechunk,
    skip_rows_after_header: skip_rows_after_header,
    encoding: encoding,
    row_index_name: row_index_name,
    row_index_offset: row_index_offset,
    try_parse_dates: try_parse_dates,
    eol_char: eol_char,
    raise_if_empty: raise_if_empty,
    truncate_ragged_lines: truncate_ragged_lines,
    decimal_comma: decimal_comma,
    glob: glob,
    storage_options: storage_options,
    credential_provider: credential_provider_builder,
    include_file_paths: include_file_paths
  )
end