Source code for yadr.base

"""
base
~~~~

Base classes for the :mod:`yadr` package.
"""
from abc import ABC, abstractmethod
from collections.abc import Callable
from typing import Optional

from yadr.model import CompoundResult, Result, Token, TokenInfo


# Types
ResultMethod = Callable[[str], Result]
StateMethod = Callable[[str], None]


# Utility functions.
def _mutable(value, type_=list):
    """Return an empty mutable type to avoid bugs where you put a
    mutable in the signature.
    """
    if not value:
        value = type_()
    return value


# Base classes.

[docs]
class BaseLexer(ABC):
    """An abstract base class for building lexers.

    :param state_map: A dictionary mapping the state of the lexer
        to the processing method for that state.
    :param symbol_map: A dictionary mapping states of the lexer to
        characters that could occur within the text being lexed.
    :param bracket_states: (Optional.) A dictionary mapping opening
        or delimiting states to a state that collects characters
        within the brackets or delimiters to send to a more
        specific lexer.
    :param bracket_ends: (Optional.) A dictionary mapping bracket
        states to a state that processes characters after the end
        of the bracket state.
    :param result_map: (Optional.) A dictionary mapping states
        to a result transformation method to transform the data
        in the lexed string before storing it in as a token
        value.
    :param no_store: (Optional.) A list of states that should not
        be stored as tokens.
    :param init_state: (Optional.) The initial state of the lexer.
        It defaults to :class:`Token.START`.
    :return: A :class:`yadr.base.BaseLexer` object.
    :rtype: yadr.base.BaseLexer

    :class:`yadr.base.BaseLexer` lexers are state machines used for
    translating a text string into tokens for parsing. It accomplishes
    this by processing the string one character at a time, allowing the
    current state of the lexer to determine whether the character is
    legal and what should be done with it.


    State
    -----
    The current state of the lexer is determined by the value of
    :attr:`yadr.base.BaseLexer.state`. Its value will be a member
    of the enumeration used to define the tokens that exist within
    the language. This state is used to define the rule used to
    process the next character in the string.

    Characters that do not cause the state of the lexer to change
    should be appended to the end of the `buffer` attribute of
    the lexer.


    State Change
    ------------
    When the lexer encounters a character that represents the end of
    the previous token and the start of a new token, the state of
    the lexer changes. The specific details can vary based on the
    current state of the lexer, but by default the following occurs
    when the state is changed:

    *   A new :class:`model.TokenInfo` object is created that contains
        the current state of the lexer and the current value of the
        `buffer` attribute of the lexer.
    *   That :class:`model.TokenInfo` object is appended to the `tokens`
        attribute of the lexer.
    *   The `buffer` of the lexer is cleared.
    *   The `state` of the lexer is changed to the new state.
    *   The :meth:`BaseLexer.process` method is changed to the process
        method for the new state.


    :meth:`BaseLexer.process()` and Processing Methods
    --------------------------------------------------
    The :meth:`BaseLexer.process` method of a :class:`BaseLexer`
    subclass should not be defined. Instead the name should be
    assigned to a "processing" method specific to the current state
    of the lexer. By convention, the names of these method starts
    with an underscore, which is followed by the name of the state in
    lowercase letters. So the processing method for the state::

        Token.GROUP_OPEN

    would be::

        _group_open

    The signature for processing methods are::

        (self, char: str) -> None

    where `char` is the character being processed.

    While specific tokens may require different behavior, in general
    a processing method does two things:

    *   Define a list of states that are allowed to follow the
        current state within the syntax being lexed.
    *   Pass that list and the character to :meth:`BaseLexer._check_char`,
        which handles the actual processing.

    The end result of calling a processing method is usually that
    the characters in the string that make up the symbol for the
    current state are stored in a "TokenInfo" :class:`tuple`, which
    consists of the token representing the state and the characters of
    the symbol. These tokens will then be used by the parser to
    execute the command contained in the string.


    The State Map
    -------------
    To determine the correct processing method to use for a state, the
    lexer needs to have a mapping that defines the method for the state.
    This dictionary is the "state map." The tokens for the state are
    the keys, and the processing method for that state is the value
    for the key. This dictionary is passed into the :class:`BaseLexer`
    as the `state_map` parameter when the lexer is initialized.


    The Symbol Map
    --------------
    A BaseLexer uses a "symbol map" to associate characters in the
    string to a state. The symbol map is a dictionary. The keys are
    the tokens from the enumeration that defines state. The values
    are a list of the strings that are allowed in that state. For
    example, if you have a token named "MULDIV" that is the state for
    multiplication and division operators, the symbol map might look
    like::

        >>> state_map = {
        >>>     Token.MULDIV: ['*', '/'],
        >>> }

    The symbol map is passed to the `symbol_map` parameter when the
    lexer is initialized.


    Bracketing
    ----------
    Instead of running each character through
    :class:`BaseLexer._check_char`, it is possible
    for a processing method to instead "bracket" characters
    until a specific character is reached. For example, characters
    after a quotation mark can be collected as a substring until
    the lexer hits another quotation mark.

    Why do this? The main use for this is to turn the bracketed
    substring into a single token, rather than three tokens: the
    opening bracket/delimiter, the content of the bracket, and the
    closing brack/delimiter.

    To expand on the quotation marks example above, let's characters
    surrounded by quotation marks to belong to a token called
    "QUALIFIER". We have the following enumeration of states and
    a `symbol_map` that defines which characters belong to which
    states::

        >>> from enum import auto, Enum
        >>> class Token(Enum):
        >>>     QUALIFIER = auto()
        >>>     DELIM = auto()
        >>>     QUALIFIER_END = auto()
        >>>
        >>> symbol_map = {
        >>>     Token.QUALIFIER: '',
        >>>     Token.DELIM: '"',
        >>>     QUALIFIER_END: '',
        >>> }

    The string we want to lex is::

        >>> text = '"spam"'

    Without a bracket state, you'd end up with a token list that
    would look like the following, assuming the logic for the
    QUALIFIER state is written to accept alphabetical characters
    as valid for qualifiers::

        >>> (
        >>>     (Token.DELIM, '"'),
        >>>     (Token.QUALIFER, 'spam'),
        >>>     (Token.DELIM, '"'),
        >>> )

    That's probably fine, but the delimiter tokens don't really
    do anything at this point. They were just there to set out
    the qualifier in the string. So, you can have them excluded
    from the token list like the following by using bracketing::

        >>> (
        >>>     (Token.QUALIFER, 'spam'),
        >>> )

    The real power here comes from combining with a result map
    to send the bracketed content of to a different lexer and
    parser, which allows syntaxes to be nested within each other.


    Bracket States
    --------------
    To have a processing method bracket, you need to associate
    the state for the opening bracket or delimiter with a
    processing method that handles the bracketing in a dictionary
    passed to the `bracket_states` parameter when initializing the
    BaseLexer. The `bracket_states` dictionary for the above example
    would look like this:

        >>> bracket_state = {
        >>>     Token.DELIM: Token.QUALIFIER,
        >>> }


    Bracket Ends
    ------------
    Because a bracket state hides the closing bracket or delimiter
    from the lexer, you need a different way to handle the state
    after a bracket state. This is handled by a standard processing
    method. By convention the name of this method is an underscore
    followed by the name of the bracket state followed by an
    underscore and then the word "end". For our example it would be::

        _qualifier_end

    This state needs to have a state token assigned for it. In our
    example that is the `Token.QUALIFIER_END` token.

    This end state then needs to be linked to the bracket state in
    a dictionary that is passed to the `bracket_ends` parameter
    when initializing the lexer. In the example, the `bracket_ends`
    dictionary would look like::

        bracket_ends = {
            Token.QUALIFIER: Token.QUALIFIER_END,
        }


    No Store
    --------
    Some states, like the initial state, bracket end states, and
    white space, shouldn't be stored as tokens. These are defined
    by the "no store" list, which is passed to the `no_store`
    parameter when the lexer is initialized. For the example above,
    the no store list could look like::

        >>> no_store = [
        >>>     Token.QUALIFIER_END,
        >>> ]


    Result Transformations
    ----------------------
    By default, a :class:`BaseLexer` stores the symbols for the token
    as a string in the TokenInfo. This behavior can be changed with a
    "result transformation" method. By convention the name of a
    result transformation starts with an underscore, the letters "tf",
    an underscore, and the name of the state they affect in all lower
    case. So the name of a result transformation method for the
    `Token.NUMBER` state would be::

        _tf_number

    Result transformations have the following signature::

        (self, value:str) -> <type_of_the_transformed_value>

    .. warning:
        The return type of the result transformation method needs
        to be added to the types allowed for TokenInfo. This adds
        complexity that has downstream affects on the parser.

    In the case of something like `Token.NUMBER` the transformation
    can be very simple, coercing a string to an integer. However,
    more complex transformations are possible, such as sending
    bracketed symbols to a different lexer and parser to allow syntax
    nesting.


    Result Map
    ----------
    In order to link the result transformation methods to a state,
    a :class:`BaseLexer` needs a "result map". The result map is a
    dictionary. The keys are the states where the transforms are used.
    The values are the result transformation methods to use for that
    state. For example, the result map for a lexer that transforms
    numbers and qualifiers might look like::

        >>> result_map = {
        >>>     Token.NUMBER: _tf_number,
        >>>     Token.QUALIFIER: _qualifier,
        >>> }

    The result map is passed to the `result_map` parameter when the
    :class:`BaseLexer` is initialized.
    """
    def __init__(
        self, state_map: dict[Token, StateMethod],
        symbol_map: dict[Token, list[str]],
        bracket_states: Optional[dict[Token, Token]] = None,
        bracket_ends: Optional[dict[Token, Token]] = None,
        result_map: Optional[dict[Token, ResultMethod]] = None,
        no_store: Optional[list[Token]] = None,
        init_state: Token = Token.START
    ) -> None:
        """Initialize an instance of :class:`BaseLexer`."""
        # Assign the passed parameters.
        self.state_map = state_map
        self.symbol_map = symbol_map
        self.bracket_states = _mutable(bracket_states, dict)
        self.bracket_ends = _mutable(bracket_ends, dict)
        self.result_map = _mutable(result_map, dict)
        self.no_store = _mutable(no_store)
        self.init_state = init_state

        # Assign internal attributes.
        self.state = init_state
        self.process: StateMethod = self._start
        self.buffer = ''
        self.tokens: list[TokenInfo] = []

    # Public methods.

[docs]
    def lex(self, code: str) -> tuple[TokenInfo, ...]:
        """Lex code into tokens for parsing.

        :param code: A string of code to tranform into tokens.
        :return: A :class:`tuple` object.
        :rtype: tuple
        """
        # Process each character in the code.
        for char in code:
            self.process(char)

        # Reset the lexer after processing the string in case the lexer
        # is reused.
        else:
            self._change_state(self.init_state, '')

        # Return the tokens from the code.
        return tuple(self.tokens)


    # Private operation method.
    def _is_token_start(self, token: Token, char: str) -> bool:
        """Is the given character the start of a new token."""
        valid = {s[0] for s in self.symbol_map[token]}
        return char in valid

    def _is_token_still(self, char: str) -> bool:
        """Is the given character still a part of the current token."""
        index = len(self.buffer)
        tokens = [t for t in self.symbol_map[self.state] if len(t) > index]
        if tokens:
            valid = {s[index] for s in tokens}
            return char in valid
        return False

    def _cannot_follow(self, char: str) -> None:
        """The character is not allowed by the current state."""
        state = self.state.name
        if state == 'WHITESPACE' and self.tokens:
            state = self.tokens[-1][0].name
        elif state == 'WHITESPACE':
            state = 'START'
        if state == 'QUALIFIER_END':
            state = 'QUALIFIER'
        if state == 'NUMBER' and self.buffer == '-':
            state = 'NEGATIVE_SIGN'

        if state == 'START':
            msg = f'Cannot start with {char}.'
        else:
            article = 'a'
            if state[0] in 'AEIOU':
                article = 'an'
            msg = f'{char} cannot follow {article} {state}.'

        raise ValueError(msg)

    def _change_state(self, new_state: Token, char: str) -> None:
        """Terminate the previous token and start a new one."""
        # Terminate and store the old token.
        if self.state not in self.no_store:
            value: Result = self.buffer
            if self.state in self.result_map:
                transform = self.result_map[self.state]
                value = transform(value)
            token_info = (self.state, value)
            self.tokens.append(token_info)

        # Set new state.
        self.buffer = char
        self.state = new_state
        self.process = self.state_map[new_state]

    def _check_char(self, char: str, can_follow: list) -> None:
        """Determine how to process a character."""
        new_state: Optional[Token] = None

        # If the character doesn't change the state, add it to the
        # buffer and stop processing.
        if self._is_token_still(char):
            self.buffer += char
            return None

        # Check to see if the character starts a token that is allowed
        # to follow the current token. Stop looking once you find one.
        for token in can_follow:
            if self._is_token_start(token, char):
                new_state = token
                break

        # If not, throw an exception. Since whitespace isn't a token in
        # YADN, an exception saying a character can't follow WHITESPACE
        # isn't useful. Therefore handle that case by looking at the
        # last stored token.
        else:
            self._cannot_follow(char)

        # Some tokens start a state that doesn't match the token.
        if new_state in self.bracket_states:
            new_state = self.bracket_states[new_state]

        # Catch an attempt to end a number when the only character is
        # negative sign.
        if new_state and self.state == Token.NUMBER and self.buffer == '-':
            self._cannot_follow(char)

        # If the state changed, change the state.
        if new_state:
            self._change_state(new_state, char)

    # Lexing rules.
    @abstractmethod
    def _start(self, char: str) -> None:
        """An abstract method for the processing method used for the
        initial state of the lexer.

        :param char: The character currently being lexed.
        :return: None
        :rtype: NoneType
        """
        # The tokens that are allowed to follow the current state.
        can_follow: list[Token] = []

        # Check to see if the current character causes the lexer
        # to change state.
        self._check_char(char, can_follow)

    def _whitespace(self, char: str) -> None:
        """Lex white space."""
        if char.isspace():
            return None
        prev_state = self.init_state
        if self.tokens:
            prev_state = self.tokens[-1][0]
        if prev_state in self.bracket_ends:
            prev_state = self.bracket_ends[prev_state]
        process = self.state_map[prev_state]
        process(char)