Skip to content

Consider using UTF-8 when encoding is unspecified #200

@danstoner

Description

@danstoner

For example, attempting to ingest:

https://unhcollection.unh.edu/database/content/dwca/UNHC-UNHC_DwC-A.zip

The published Darwin Core Archive includes a meta.xml which has a blank encoding value:

encoding=""

The rest of that line looks like:

<core dateFormat="YYYY-MM-DD" encoding="" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy=""" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">

The encoding value tell the consumers of the occurrence file how to process the file properly.

The data provider has been unable to resolve the situation in over a year.

https://redmine.idigbio.org/issues/3002

Consider whether it is worth applying UTF-8 encoding in this situation so the data can be ingested, or whether it still makes sense to hard fail since there is a chance of "bad things" if the encoding turns out to be mismatched.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions