figure out what's causing the newline issues in the RFE/RL scraper
completed by: Sushain Cherivirala
mentors: Francis Tyers, Jonathan
The scraper we use to build corpora to test transducers has recently acquired some issues with newlines.
Namely, it adds where \n (and sometimes <br />s) used to be, and doesn't add newlines after things in <p>...</p> and <div>...</div> blocks. However, it used to do all this stuff correctly.
Your task is to track down what's causing this problem and find a work-around that [ideally] doesn't involve string replaces or looping through elements. It could be due to a new "feature" in lxml, or it could be something introduced in recent modifications of the scraper classes.
If you haven't worked with the scraper before, you should talk to us about how to test stuff. Ideally the person who chooses this task, however, already has experience working with / developing parts of the scraper.