Curious about what I worked on over the summer? Check out my
new, elegantly-named repo:
https://github.com/konahart/relation-extraction-pipeline
I decided to open source my work because a) my bosses had no problem with it
(i.e., “because I can”) and b) I believe there are parts which may be more
widely useful, such as:
- GigawordCorpusHandler, which parses the XGML structure given by
LCD's
Gigaword Corpus (consistent across English, Spanish,
and Chinese);
- CoreNLPProcessor, which uses
Stanford's CoreNLP
to perform tokenization, NER, and POS annotation (although not the most
efficiently... more on that later); and
- Utils, which provides a function to check whether a
file is gzipped, then returns the appropriate InputStream (useful for dealing
with large corpora like Gigaword, but not having to gzip all files).
In the coming months I’ll likely work on breaking them out into their own,
useful little repos, but since the code runs, I might as well make it available
now. Enjoy!