Lexing and Parsing

cloutiy · Post by **cloutiy** » Fri Jul 29, 2016 4:50 pm

Hello,

I would like to know of your experience lexing and parsing using eC:
What kind of project have you used lexing/parsing for (creating your own language? Creating a DSL? A markup?

-What is your favorite resource on the topic?
-What is your preferred method of doing it?
-Are there features (or libraries) inf eC that simplify the process?
-Do you have non-conventional ideas on how it could be done more easily?

Just throwing some stuff out there, feel free to share your experience!

yves

Post by **jerome** » Sat Jul 30, 2016 11:42 pm

Hi Yves,

This is one of these times again where the forums didn't notify me of a new message! Sorry to take so long to reply...

When I started writing the compiler for eC, I went with what I believed was the sensible approach and used Bison and Flex. Years later I strongly regret this decision. The current eC compiler is unfortunately still using this. My experience with those tools is that they are awful and completely unnecessary. In particular, things start to get very messy as soon as you want to add error tolerance to the parser (e.g. to parse incomplete code the user is still typing, e.g. for auto-complete or to report more meaningful syntax errors).

The way to go, which I think is what most modern serious parsers use these days, is Recursive Descent. I think that Wikipedia page along with its links would be a great place to start. For my part, I started writing my own RD parser for eC without really reading anything about it, because the concept is very straightforward and intuitive and how you'd naturally think about parsing. It offers a lot of advantages over a generated parser with tables like Bison, one of them being that it is so much easier to debug and step through parsing. The idea is simply that you write a 'parse' function for each construct, and the top construct invoke the 'parse' functions for the constructs it is made of. It all comes together very beautifully. Error tolerance is also very easily achieved.

You can take a look at our new RD parser (still incomplete and work in progress) on the 'ec2' branch of the SDK: https://github.com/ecere/ecere-sdk/tree ... ler/libec2

I find this code extremely elegant and a pleasure to work with, and can't wait to have the time to complete it and migrate everything to this new RD parser. As for lexing, the concept is simple enough and I would encourage writing your own lexer as well (as opposed to using flex). For the moment, libec2 is still hooked onto the Flex code ( https://github.com/ecere/ecere-sdk/blob ... /lexing.ec ), but I have started to write our own lexing code which we still need to integrate into libec2.

Do you have a particular project in mind that requires lexing/parsing? I recommend you study this eC2 RD parser code, I think it's a great starting point. If you don't have any particular project in mind and are just looking to learn about parsing, we would welcome contributions on libec2 to complete it and migrate the whole compiler to it! ( http://ec-lang.org/ideas/compiler/#rdparser ).

Best regards,

-Jerome

cloutiy · Post by **cloutiy** » Mon Aug 01, 2016 2:17 pm

Jerome, thanks for sharing your experience and some advice.

There was one project which I had done in Perl:

https://github.com/cloutiy/tml

It's a typesetting language which generates GROFF code. Meant to make groff a little easier to work with for non-technicical people.

It was kind of a hack and I used regular expressions. It also just generates groff. Ideally I'd like to scan the source files to create an AST, which could generate different targets: latex, groff, html, epub etc...

Regarding parsing techniques, one that I found particularly interesting uses the following approach:

1. Tokens are represented by a single char.
2. Input is scanned
3. If a token is found, the char which represents that token is appended to a char array
4. At the end of scanning what you end up with is a char array (a string)

The interesting bit here is that this string which was created at the end of scanning can be used as a regular expression which represents the grammar. In short:

1. The scanner consumes tokens and generates a regex string.
2. The parser consumes the regex string generated by the scanner and compares against expected grammar patterns (also represented as an array of single chars, each char representing a token type).

I may not be explaining well, but included as an attachment some (ruby) code the author was nice enough to share.

Finally, before I can help with the eC parser/compiler, I need to get my skills up to par. Before you got to the meetup I was telling everyone how although I did CompSci, I never really got to do any programming for any job. This was almost 15 years ago! But I still have an interest and thought what better way than to try to develop some ideas that i've had lying around, or anything really.

I did write a compiler in college, but my Dragon Book days are long gone. For now I found this "Build Your Own Lisp in C" which I'll try to reproduce in eC as a (re)starter project:

http://www.buildyourownlisp.com/

PS. No worry about how long (or not) it takes to reply. I like to use the forum since whatever is posted could be useful for others later on. I know for quick replies I should use the IRC.

Ecere SDK/eC Forums

Lexing and Parsing

Lexing and Parsing

Re: Lexing and Parsing

Re: Lexing and Parsing