-
Notifications
You must be signed in to change notification settings - Fork 3
Add support for parsing Gregorian dates in standard text formats #160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdds Gregorian date parsing and integration: new Lark grammar and autogenerated multilingual month tokens, a module-level parser, transformer, and converter with parse(), combined-parser registration, tests, a codegen script, and a hatch env entry to regenerate month grammar. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant Converter as GregorianDateConverter
participant Parser as gregorian_parser
participant Transformer as GregorianDateTransformer
participant UndateObj as Undate
User->>Converter: parse("26 Ugushyingo 2022")
Converter->>Parser: parse(value)
Parser->>Parser: tokenize & match gregorian_date
Parser-->>Converter: parse_tree
Converter->>Transformer: transform(parse_tree)
Transformer->>Transformer: extract year, month, day
Transformer-->>Converter: UndateObj
Converter-->>User: return Undate (label set)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #160 +/- ##
===========================================
+ Coverage 98.96% 99.00% +0.04%
===========================================
Files 40 45 +5
Lines 2121 2216 +95
===========================================
+ Hits 2099 2194 +95
Misses 22 22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/undate/converters/combined.py (1)
1-5:⚠️ Potential issue | 🟡 MinorModule docstring not updated to include Gregorian.
The module-level docstring on line 2 still says "Supports EDTF, Hebrew, and Hijri" while the class docstring on line 50 was correctly updated to include Gregorian.
Proposed fix
""" -**Experimental** combined parser. Supports EDTF, Hebrew, and Hijri +**Experimental** combined parser. Supports EDTF, Gregorian, Hebrew, and Hijri where dates are unambiguous. (Year-only dates are parsed as EDTF in Gregorian calendar.) """
🤖 Fix all issues with AI agents
In `@scripts/generate_gregorian_grammar.py`:
- Around line 37-46: The grammar currently preserves Babel's exact month casing
(from get_month_names) causing case-sensitive parsing; normalize month names to
a consistent case before adding to all_month_names and emitting the grammar:
when iterating months in the nested loops, apply month_name =
month_name.strip(".").lower() (or .upper() if you prefer) and deduplicate using
that normalized value so the generated terminals are lowercase and the parser
accepts case-insensitive inputs; update any downstream places that consume
all_month_names to expect the normalized form.
In `@src/undate/converters/calendars/gregorian/converter.py`:
- Line 103: The inline comment in
src/undate/converters/calendars/gregorian/converter.py incorrectly mentions
"Hebrew" (copy-paste error); update the comment near the Gregorian parser
invocation (the parsing block in the converter module where the string is
parsed) to correctly say "Gregorian" (e.g., replace "# parse the string with our
Hebrew date parser" with a comment referencing the Gregorian date parser) so it
accurately reflects the behavior of the parse/convert routine in this converter.
- Around line 110-111: The except clauses in the calendar converters currently
catch only UnexpectedCharacters; update them to catch
lark.exceptions.UnexpectedInput instead (or import UnexpectedInput from
lark.exceptions) so UnexpectedToken and UnexpectedEOF are also handled and
re-raised as ValueError; apply this change in the Gregorian converter
(converter.py) and the same pattern in the hebrew, islamic, edtf, and combined
converter modules, ensuring the corresponding except blocks (the ones that
currently read "except UnexpectedCharacters as err") are replaced and that the
module imports include UnexpectedInput where needed.
In `@src/undate/converters/grammars/combined.lark`:
- Around line 4-7: The global `%ignore PUNCTUATION` directive in combined.lark
removes the comma token needed by the Hebrew and Islamic grammars (they define
`comma: ","` and use it in their date rules), so restore comma token visibility
by removing or narrowing that ignore: either delete `PUNCTUATION`/`%ignore
PUNCTUATION` from combined.lark, or change the approach so commas are not
ignored (e.g., apply punctuation-ignore only inside the Gregorian subgrammar or
adjust the PUNCTUATION pattern to exclude ","), ensuring the `comma` rule in the
Hebrew and Islamic grammar can be lexed and matched.
In `@src/undate/converters/grammars/gregorian_multilang.lark`:
- Around line 4-15: The grammar is case-sensitive so lowercase user input like
"january" won't match; update the generator or parsing flow: modify
scripts/generate_gregorian_grammar.py to emit month literals in a single case
(e.g., all lowercase) OR (simpler) normalize text before parsing by lowercasing
inputs in the converter that invokes the Lark parser for
gregorian_multilang.lark (i.e., call .lower() on the raw date string before
passing it to Lark), and ensure the generated grammar's month alternatives are
also lowercased (regenerate the .lark file) so names line up.
🧹 Nitpick comments (2)
tests/test_converters/test_calendars/test_gregorian/test_gregorian_transformer.py (1)
21-31: Consider adding a test case for year+month order (e.g.,"1960 Jan").The transformer tests don't cover the year-before-month ordering that the parser tests include (e.g.,
"1900 Feb"in the parser tests). Adding one here would validate the transformer handles all orderings correctly end-to-end.src/undate/converters/grammars/combined.lark (1)
42-43: Consider uncommenting the Gregorian override to avoid year-only ambiguity.The comment acknowledges that year-only parsing should be covered by EDTF, yet the
gregorian__gregorian_dateimport on line 32 brings in the full grammar including theyear-only alternative. A standalone year like"932"would be ambiguous betweenedtf__startandgregorian__gregorian_datein the combined parser. The Hebrew and Islamic grammars have their%overridedirectives active (lines 37, 40) for the same reason.Is there a specific reason this override is commented out? If not, activating it would be consistent with the approach for the other calendars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@scripts/generate_gregorian_grammar.py`:
- Around line 1-11: The docstring in the script generate_gregorian_grammar.py
contains a typo "regeneate" on the line describing how to run the script; update
that word to "regenerate" in the module docstring so the usage text reads "Run
this script with hatch to regenerate the file::".
🧹 Nitpick comments (1)
DEVELOPER_NOTES.md (1)
94-103: Use a fenced code block for consistency with the rest of the file.The indented code block on Line 101 and the trailing
::on Line 99 (an RST convention) are inconsistent with the fenced (```) style used elsewhere in this Markdown file. Also flagged by markdownlint (MD046).Suggested fix
-library. To regenerate, run the script with hatch (which should -be installed globally):: +library. To regenerate, run the script with hatch (which should +be installed globally): - hatch run codegen:generate - +```sh +hatch run codegen:generate +``` + When the `.lark` file is modified by the script, it must be committed to git.
resolves #127
Uses a code-generation script to create the grammar for month names (full and abbreviated) based on a list of languages. The approach for the hatch codegen script is based on a conversation with claude.ai
Summary by CodeRabbit
New Features
Tests
Documentation
Chores