Old Blog Posts

These are blog posts from my old website.

I’m trying a new approach to reading literature. For each paper I read, I add some content here. My hope is that this helps me get more out of my reading by forcing me to identify what mattered most to me in the paper and to link it to other things I have read.

Note, this is not a literature review. I am not aiming to be comprehensive or take a random sample. The papers I read and write about reflect my interests, biases, and opinions. I also don’t completely summarise the work, but rather focus on the aspects that I wanted to remember.

Reading papers

Advice from elsewhere:

To help me identify the papers I want to read, I have been using the following method:

  1. Go through the proceedings for a conference on the ACL anthology and read every title. Based on the title, decide whether to read the abstract. Based on the abstract, decide whether to read the introduction, in which case open the paper in a tab.
  2. Bookmark all tabs. Either use Shift+Command+D or Bookmarks -> Bookmark All Tabs.
  3. Export the folder of bookmarks to a file. To do this, go to chrome://bookmarks, select the new folder then use the menu on the far right of the blue bar to select Export Bookmarks.
  4. Run the code below, with bookmarks_DATE.html as input (note, requires PyPDF2). This produces a pdf with only the introduction of each paper (approximately).
  5. Read through the pdf this produces and flag the papers to read all of.
# Get the paper URLs
import sys
papers = {}
for line in sys.stdin:
    if 'aclanthology.org' in line:
        content = line.strip()
        url = content.split()[1].split('"')[1][:-1] + ".pdf"
        name = content.split(" - ACL Anthology")[0].split(">")[-1]
        papers[name] = url

# Download the papers
import io, requests
PDFs = {}
for name, url in papers.items():
    r = requests.get(url, auth=('usrname', 'password'), verify=False,stream=True)
    assert 200 <= r.status_code < 400
    r.raw.decode_content = True
    PDFs[name] = io.BytesIO(r.content)

# Get the Introductions
from PyPDF2 import PdfFileReader, PdfFileWriter
import string
pdf_writer = PdfFileWriter()
for name, raw_pdf in PDFs.items():
    pdf = PdfFileReader(raw_pdf)
    page0 = pdf.getPage(0)
    pdf_writer.addPage(page0)
    text = page0.extractText().split('\n')
    done = False
    for part in text:
        # Try to find the start of section 2
        if part.startswith('2') and len(part) > 1:
            if part[1] in string.ascii_letters:
                done = True
    if not done:
        page1 = pdf.getPage(1)
        start = page1.extractText().split('\n')[0]
        # Try to find the start of section 2
        if start.startswith('2') and len(start) > 1:
            if start[1] in string.ascii_letters:
                done = True
        if not done:
            pdf_writer.addPage(page1)

with open('example.pdf', 'wb') as out:
    pdf_writer.write(out)