Wider numeric field format support#44
Conversation
Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'.
pdf_statement_reader/parse.py
Outdated
| def clean_numeric(df, config): | ||
| numeric_cols = [config["columns"][col] for col in config["cleaning"]["numeric"]] | ||
|
|
||
| def format_negatives(s): |
There was a problem hiding this comment.
I know I haven't been good yet about writing tests for the project, but can you please add some tests to show that the format_currency_number method actually does what it's supposed to.
I have a natural aversion to regex, mostly because it generally results in the opposite of what python generally gives us - clean human readable code that's easy to easily visually grep what's going on. Python's built in string methods usually do a better job at this. In cases like this however with multiple complex rules, I'll concede that using regex may result in cleaner code. I'd only include it if there were tests against it though. Just reading it now, I'm not certain what all the side effects and edge cases might be.
I'll update with a sample test on the original format_negatives method that you can use as a basis.
There was a problem hiding this comment.
I moved the format_negatives method outside of the clean_numeric method so that it could be tested but that's now caused a merge conflict with your branch. Should be easy enough to fix though.
Looking at that old code I wrote in parse.py there's certainly a lot to be desired in terms of it being robust to the different content that might come out of a pdf statement file
Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'. See #33.