The Tale of the Fuzzy Text

Manually overriding the EDID data on GNU/Linux with the amdgpu driver

Or: how Camille ended up learning more than she probably needed about EDID, Linux’s amdgpu driver, and even the boot process. This will be relevant to you if you own a recent AMD GPU (one that uses amdgpu instead of the old radeon driver) and have a monitor displaying inexplicable fuzzy and pixelated text.

Preamble

So you just got yourself a fancy new AMD graphics card. Maybe you dual boot Windows and GNU/Linux, the former for gaming and the latter for work. You’ve already booted into Windows and flexed your F250-sized GPU’s muscles on some games, so now it’s time to make sure things work correctly on your Linux install. You boot that sucker up, log in without issue, and are about to declare victory – when you squint a tad. Are your glasses out of focus? Maybe you don’t even wear glasses, but your text is kind of fuzzy and pixelated. Oh no. Something is wrong with the graphics driver. You pray to the Machine Spirits that it’s a simple resolution or refresh rate issue, and start poking at your display settings. Nothing works. Your resolution is correct, your refresh rate is fine, and the card has correctly detected your monitor. If you’ve been using Linux for a while, you sigh deeply, for you know what’s coming: This Is Going To Suck.

To save you some suspense, dear reader, I’ll tell you now what the problem is: your GPU has decided that your monitor is a TV, and is feeding it YCbCr colors. With less color space to work with, and slightly different rendering, the edges of things like text get funky, and everything feels slightly out of focus. Your monitor would rather be fed full RGB, but it will settle for YCbCr. And unfortunately, because your monitor was trusting enough to tell amdgpu that it’s capable of both, the driver decided it would be having YCbCr, whether it wants it or not.

Well then! –

that's easy

– you just have to switch your output to RGB mode, after which you can happily skip along to whatever grail you’re implementing. Alas, It Is Not That Easy.

Initial Thought: Poke X11

Neither the display manager or shell gives you options to change the color signal format. X11, however, is aware, and makes the decision based on the graphics driver. If you’re on an old card, and are thus on the radeon driver, the property will probably show up in the output from xrandr; but if that’s the case, you won’t be having this problem in the first place, as it only shows up with the newer amdgpu driver. If you look at the X11 log (on newish Ubuntu it will live at ~/.local/share/xorg/Xorg.0.log), you’ll find a line somewhat like:

AMDGPU(0): Supported color encodings: RGB 4:4:4 YCrCb 4:4:4

One would think you could use xrandr to tell X11 to prefer RGB. One would be wrong. The relevant entries don’t exist for amdgpu – no output_csc, no Broadcast RGB. Which make sense, when you think about it, because of that nice little comment in the linked amdgpu source above: /* TODO: un-hardcode */. Well then.

Next Thought: Spoof the EDID

The EDID Binary

I remembered that I had a similar issue a few years back when using this monitor with my MacBook, and that the solution was to give it a custom EDID file to override what the monitor provides. EDID, or Extended Display Identification Data, is the format your monitor uses to tell your video hardware about its capabilities. EDID as a 128 byte binary format (with support for extension blocks); X11 will report it like so:

	EDID:
		00ffffffffffff0004728700b21e4084
		2c12010380301b78caee95a3544c9926
		0f5054bfef80714f8180b300d1c09500
		a940814081c01a3680a070381e403020
		3500132b21000018000000fd00384c1f
		5311000a202020202020000000fc0048
		323133480a20202020202020000000ff
		004c46383044303032383530300a0148
		02031df14a9005040302011112131423
		09070765030c00100083010000023a80
		1871382d40582c450007442100001f01
		1d8018711c1620582c25000744210000
		9f011d007251d01e206e285500c48e21
		00001e8c0ad08a20e02d10103e960007
		44210000180000000000000000000000
		0000000000000000000000000000000d

The linked Wiki article gives a detailed explanation of how the various fields are encoded; for our purposes, we care about bits 3 and 4 of byte 24, which encode the display type. My monitor happily reports 01 for these bits, which tells the GPU it can handle RGB 4:4:4 and YCrCb 4:4:4, after which amdgpu says “thanks, I hereby dub you a TV, now fuck off.”

On MacOS, I was able to use this handy script, which snarfs your monitor’s EDID, munges the correct bits, and sticks the resulting EDID binary in a place the kernel cares about. On linux, the commands are different.

Acquiring the EDID Binary

You might get lucky and find an existing modified EDID binary for your specific monitor, perhaps in this thread. Mine is an Acer K272HUL, so I had to do it myself.

You’ll need an EDID file for you monitor to modify. There are couple ways to do this. The first is the classic unixy “everything is a file” approach; you can run:

╭─ ‹base› camille@galactica ~ ‹main›
╰─$ find /sys/devices/pci*/ -name edid
/sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card0/card0-HDMI-A-1/edid
/sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card0/card0-DP-2/edid
/sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card0/card0-DP-3/edid
/sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card0/card0-DP-1/edid

These files get the EDID for the monitor connected to the corresponding port. Your directory structure will be different, depending on how your PCI slots and lanes are laid out. If you cat one, you get a bunch of binary in your terminal. There happens to be a Debian package called read-edid that contains a program that will parse this for you (apt install read-edid); for me, for example, I get:

╭─ ‹base› camille@galactica ~ ‹main›
╰─$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card0/card0-DP-2/edid | parse-edid
Checksum Correct

Section "Monitor"
	Identifier "Acer K272HUL"
	ModelName "Acer K272HUL"
	VendorName "ACR"
	# Monitor Manufactured week 4 of 2014
	# EDID version 1.3
	# Digital Display

And so forth (there will be much more output). Note that this is in X11 config format. So let’s copy it…

cat /sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:00.0/0000:03:00.0/drm/card0/card0-DP-2/edid > K272HUL.original.bin

That read-edid package also comes with a utility called get-edid that you can use to extract the EDID file. I suspect it basically does the same thing as above, but try it out if you’d like.

Modifying the EDID Binary

There are a couple ways to do this. This user-friend GUI way of doing it is to use a program like wxedid; I had to compile it from source, which was a standard ./configure; make deal. There is also a package in AUR if you’re on Arch. You can set the appropriate bits to 0 down in the CHD section.

You could also do it manually; for example, you might modify that Ruby script from earlier, or use it as a guide to write you own. For example, here’s some stripped down Python:

#!/usr/bin/env python

import argparse
from functools import reduce
import sys

def printerr(*args):
    print(*args, file=sys.stderr)

def print_edid_as_hex(edid_bytes):
    edid_hex = ['{:02x}'.format(c) for c in edid_bytes]

    for idx in range(0, len(edid_hex), 16):
        printerr(''.join(edid_hex[idx:idx+16]))


parser = argparse.ArgumentParser()
parser.add_argument('edid_file')
args = parser.parse_args()

edid_string = ''
with open(args.edid_file, 'rb') as fp:
    edid_string = bytearray(fp.read())

printerr('Source EDID (Hex)')
printerr('-' * 64)
printerr()
print_edid_as_hex(edid_string)
printerr()

printerr('Color modes:')
if edid_string[24] & 0b11000 == 0b0:
    printerr('RGB 4:4:4 only')
elif edid_string[24] & 0b11000 == 0b01000:
    printerr('RGB 4:4:4 and YCrCb 4:4:4')
elif edid_string[24] & 0b11000 == 0b10000:
    printerr('RGB 4:4:4 and YCrCb 4:2:2')
else:
    printerr('RGB 4:4:4, YCrCb 4:4:4, and YCrCb 4:2:2')

printerr('Begin modifying EDID...')
printerr('Setting color mode to RGB 4:4:4 only...')

edid_string[24] &= ~(0b11000)
printerr()

printerr('Removing extension blocks...')
printerr(f'Number of extension blocks: {edid_string[126]}')
printerr('Removing extension blocks...')

edid_string = edid_string[:128]
edid_string[126] = 0
edid_string[127] = (0x100 - (reduce(lambda a, b: a + b, edid_string[0:127]) % 256)) % 256
printerr()

printerr('Final EDID (Hex)')
printerr('-' * 64)
print_edid_as_hex(edid_string)
printerr()

sys.stdout.buffer.write(edid_string)

Which will give you something like this:

╭─ ‹base› camille@galactica ~ ‹main›
╰─$ python patch-edid-minimal.py .config/K272HUL.original.bin > K272HUL.modified.bin
Source EDID (Hex)
----------------------------------------------------------------

00ffffffffffff000472dd035d5a4040
04180103803c22782a4b75a7564ba325
0a5054bd4b00d100d1c08180950f9500
b30081c0a940565e00a0a0a029503020
350055502100001e000000fd00174c0f
4b1e000a202020202020000000ff0054
30534141303031343230300a000000fc
0041636572204b32373248554c0a0198
020324744f0102030506071011121314
15161f04230907078301000067030c00
2000b83c023a80d072382d40102c9680
565021000018011d8018711c1620582c
250056502100009e011d80d0721c1620
102c258056502100009e011d00bc52d0
1e20b828554056502100001e8c0ad090
204031200c405500565021000018007e

Color modes:
RGB 4:4:4 and YCrCb 4:4:4
Begin modifying EDID...
Setting color mode to RGB 4:4:4 only...

Removing extension blocks...
Number of extension blocks: 1
Removing extension blocks...

Final EDID (Hex)
----------------------------------------------------------------
00ffffffffffff000472dd035d5a4040
04180103803c2278224b75a7564ba325
0a5054bd4b00d100d1c08180950f9500
b30081c0a940565e00a0a0a029503020
350055502100001e000000fd00174c0f
4b1e000a202020202020000000ff0054
30534141303031343230300a000000fc
0041636572204b32373248554c0a00a1

Regardless of your method, you’ve now got an EDID binary matching your monitor without the YCbCr support. Now, we simply tell X11 about it and we’re good to go!

Informing X11

Welp, you can’t.

I tried many. different. variants of X11 configuration options; I tried telling it it in a Device block, a Screen block, a Monitor block, to no avail. In fact, it appears once again that the existence of the CustomEDID option is driver-dependent, as it doesn’t even exist in the current version’s manual.

Apparently, we’ll have to go to a lower level.

Informing the Kernel

A number of guides describe the need to load the customized EDID into the kernel as a firmware at boot. The all-powerful Arch wiki itself has a section on this, where it provides several solutions. There even is a convenient way to do this after boot, according to the Wiki, using kernel debugging features. As root:

cat K272HUL.modified.bin > /sys/kernel/debug/dri/0/DP-2/edid_override

Well, this did nothing for me. I didn’t spend too much time playing with it, so maybe some reader will have better luck. Instead, I moved on to boot-time loading.

I based my initial attempts off this helpful guide by someone called TingPing, the aforementioned Arch wiki article, and Canonical’s documentation for adding kernel boot parameters. I have multiple monitors, so I needed to load the EDID for a specific port; note the DP-2: before the file path (the firmware directory gets prefixed automatically). So I stick the modified binary in /usr/lib/firmware/edid/ and put the parameters in /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="splash drm.edid_firmware=DP-2:edid/K272HUL.modified.bin"

Then you just run update-grub, reboot, and! –

nothing changes! Shit.

A Helpful Bug Report

Naturally, I looked at dmesg to see if the damn kernel would explain itself, and it did! A grep for EDID turned up:

[    1.874241] kernel: [drm:edid_load [drm]] *ERROR* Requesting EDID firmware "edid/K272HUL.modified.bin" failed (err=-2)

Well, it kind of explained itself. Googling for this error turns up a fair number of results, most of which are for corrupted EDID files, but in that case the kernel does more specific complaining – my binary is just fine thankyouverymuch. I finally stumbled upon this bug report, however, which was the last piece of the puzzle. Somewhere along the line, Ubuntu (or the kernel itself?) got more secure, and started requiring the firmware be in the initramfs image. So, you have to add a helper to your initramfs creation: you create a new file at /etc/initramfs-tools/hooks/edid, which contains:

#!/bin/sh
PREREQ=""
prereqs()
{
    echo "$PREREQ"
}

case $1 in
prereqs)
    prereqs
    exit 0
    ;;
esac

. /usr/share/initramfs-tools/hook-functions
# Begin real processing below this line
mkdir -p "${DESTDIR}/lib/firmware/edid"
cp -a /usr/lib/firmware/edid/K272HUL.modified.bin "${DESTDIR}/lib/firmware/edid/K272HUL.modified.bin"
exit 0

Make it executable with chmod +x /etc/initramfs-tools/hooks/edid, and then rebuild the image with update-initramfs -u. If all goes well, you’ll have a new set of images spit out with no error codes. You’ll reboot, furtively peak at your monitor, and!…

success gif

SUCCESS! Your text is nice and crisp, and now if you grep your dmesg, it will say:

[    1.930533] kernel: [drm] Got external EDID base block and 1 extension from "edid/K272HUL.modified.bin" for connector "DP-2"

And the X11 log, when loading up your monitor, should now have:

[   118.655] (II) AMDGPU(0): Supported color encodings: RGB 4:4:4

Sweet, sweet victory.

Automating The Dark Arts

Packaging and Distributing cppyy-generated Python Bindings for C++ Projects with CMake and Setuptools

TL;DR

I rewrote the cppyy CMake modules to be much more user friendly and to work using only Anaconda/PyPI packages, and to generate more feature-complete and customizable Python packages using CMake’s configure_file, while also supporting distribution of cppyy pythonization functions. I then rewrote the existing k-nearest-neighbors example project to use my new system, and wrote bindings generation for bbhash as an example with a real library. Finally, I wrote a recipe for cookiecutter to generate project templates for CMake-built cppyy bindings.

A Bit About Bindings

I’ve been writing Python bindings for C++ code for a number of years now. My first experiences were with raw CPython for our lab’s khmer/oxli project; if you’ve ever done this before, you’re aware that it’s laborious, filled with boilerplate and incantations which are required but easily overlooked. After a while, we were able to convince Titus that the maintenance burden and code bloat would be much lessened by switching to Cython for bindings generation. After a major refactor, most of the old raw CPython was excised.

Cython, however, while powerful, is much more suited for interacting with C than C++. Its support for templates is limited (C++11 features are only marginally supported, let alone newer features like variadic templates); it often fails with more than a few overloads; and there’s no built-in mechanism for generating Python classes from templated code. Furthermore, one has to redefine the C/C++ interface in cdef extern blocks in Cython .pxd files, before dealing with many issues converting types and dealing with encodings. In short, Cython obviates the need for a lot of the boilerplate required with raw CPython, while requiring its own boilerplate and keeping three distinct layers of code synchronized.

My own research codebase makes considerably more use of C++11 and beyond template features than the khmer/oxli codebase, but I still wrap it in Python for easier high-level scripting and, in particular, testing. I’ve been working around Cython’s template limitations with a jury-rigged solution: Cython files defined with Jinja2 templates and a hacknied template type substitution mechanism, all hooked together with my own build system written in pydoit with its own pile of half-assed code for deducing Cython compilation dependencies, a feature which setuptools for Cython currently lacks. This system has worked surprisingly well, regardless of my labmates’ looks of horror when I describe it, but its obviously brittle. Luckily, there is a better way…

Automatic Bindings Generation

While discussing this eldritch horror of a build system, my labmate Luiz mentioned a project he’d heard of in passing called cppyy. I began looking into it, and then tested it out on my project. I quickly came to the conclusion that it’s the darkest bit of code art I’ve ever laid eyes on, and I can only assume that Wim Lavrijsen and coauthors spent many a long night around a demon-summoning circle to bring it into existence. Regardless, when it comes to projects that so nicely solve my problems, I can work around Azazel being a core dev.

cppyy is built around the cling interactive C++ interpreter, which itself grew out of the ROOT project at CERN. It uses Clang/LLVM to generate introspection data for C++ code, and then generates and JITs bindings for use in Python. A demo and some more explanation is available on the Python Wiki, though I believe the cppyy docs and their notebook tutorial are more up to date.

cppyy does, to put it mildly, some damn cool shit. A few examples are:

  • Python metaclasses for C++ template classes: the C++ templates become first-class objects in Python, and used the bracket operators for type selection.
  • C++ namespaces as Python submodules.
  • C++17 features. It handles templates so well it can even wrap BOOST.
  • Ability to pass Python functions into C++ by converting them to std::function!
  • Cross-language inheritance. You can even override C++ pure virtual methods, or overload C++ functions, in derived classes, on the Python side
  • Everything is generated lazily. This adds some startup time the first time classes are requested, but it’s all JIT’d after that.

And remember that this happens automatically. Awesome!

Meta

This post isn’t meant to be a complete tutorial on cppyy. For that, you should look through the documentation and tutorial. Rather, I’m aiming to head off problems that I ran into along the way, and then provide some solutions. So, read on!

Getting it Working

Now, with much love to our colleagues at CERN and elsewhere, my experience interacting with code written by physicists, or even touched by physicists, has been checkered at best. cppyy suffers from some of that familiar lack of tutorials and documentation, but is greatly served by an extremely responsive developer who also happens to seem like a nice person. wlav was very helpful during my experimentation, and I thank them for that.

With that said, I’m going to go through some of the problems I ran into. Ultimately, I decided that if I was going to solve those problems for me, I ought to solve them for everybody, hence this post and the associated software.

To start testing, I immediately began trying to run the associated tools on my own code to see if I could get a few classes working. First, I would need to install cppyy, which turns out to be quite simple. Unfortunately, there is not a recent package on conda-forge, but PyPI is up-to-date, and you can install with pip install cppyy. This will build cppyy’s modified libcling.

Dependencies

The documentation suggested running the code through genreflex, which would require an interface file #includeing my headers and explicitly declaring any template specializations I would need. genreflex runs rootcling with a bunch of preconfigured options, which ends up calling into clang and hence needs properly configured includes and library paths. It’s likely it will fail at first, due to being unable to find libclang.so or some variant; this can be solved by a sudo apt install clang-7 libclang-7-dev, or whatever the equivalent for your distribution. You can then go on to run genreflex and then compile with your system compilers as described in the docs.

This is fine for mucking about on your own computer, but ultimately, with modern scientific software, it’s desirable to get this working in a conda environment (and for my purposes, bioconda). This required a lot of trial and error, but ultimately, the necessary minimal invocation for the rest of this post is:

conda create -n cppyy-example python=3 cxx-compiler c-compiler clangdev libcxx libstdcxx-ng libgcc-ng cmake
conda activate cppyy-example 
pip install cppyy clang

The cxx-compiler and c-compiler packages bring in the conda-configured gcc and g++ binaries, and libcxx, libstdcxx-ng, and libgcc-ng bring in the standard libraries. Finally, clangdev brings in libclang.so, which is needed down the line, as is the Python clang package. Then we install cppyy from PyPI with pip, which will build its own cling and whatnot.

At this point, you should be able to generate and build bindings, and then load the resulting dictionaries, as described.

Build Systems

Now that I was able to get things working, I wanted to get it fit into a proper build system. I had already decided to convert my project to CMake, and cppyy happens to include its own cmake modules for automating the process. Unfortunately, I rather quickly ran into issues here:

  • As opposed to the documentation, which uses genreflex for introspection generation, followed by a call to the compiler to create a shared , the cppyy_add_bindings function provided by FindCppyy.cmake calls rootcling directly. This means you can’t use a selection XML to select and unselect C++ names to bind, and I was completely unable to get rootcling’s LinkDef mechanism working. This resulted in invalid code being generated, and I was ultimately unable to get it to compile.
  • cppyy includes a script called cppyy-generator which parses your headers and provides a mapping so that a provided initializor routine can inject all the C++ bindings names into the Python module’s namespace; this allows you to call dir() on a namespace and see what’s available before the names are requested and lazily compiled. This script, however, uses the Python clang bindings (pip install clang), which need to find a libclang.so. You can do a conda install clangdev to get the dynamically linked library in a conda environment, but this will still fail, because it also needs the clangdev headers. The conda package tucks these away in $CONDA_PREFIX/lib/clang/<CLANG_VERSION>/include, which is not in any default include path, and so the provided FindLibClang.cmake fails. Minor modifications are not enough: if you add this directory to the provided INCLUDE_DIRS argument to cppyy_add_bindings, it will be passed to the rootcling and g++ invocations as well, which will fail with all sorts of compilation and linking errors because you’ve just mixed up several standard library versions.
  • If you get it all working, the resulting bindings library will fail to find symbols, as described here. This is because CMake needs the LINK_WHAT_YOU_USE directive to instruct ld not to drop the symbols for your C++ shared library.
  • The provided setup.py generators provide no ability to customize for your own target. They dump a string directly to a file with only a few basic package and author options.
  • There is no ability to package pythonization routines. These are sorely need to make some of the directly-generated C++ interfaces more Pythonic. The autogenerated package also lacks things like a MANIFEST.in.

Results

A First Pass: bbhash

So, I went about fixing these things. I took a step away from my rather more complex project, and aimed at wrapping a smaller library (which happens to be one of my dependencies), bbhash, which provides minimal perfect hash functions. This also gave me the chance to learn a lot more about CMake. The results can be found on my github. Essentially, this implementation solves the problems listed above: it uses genreflex and a selection XML; it properly finds and utilizes all the conda compilers and libraries; it allows for packaging pythonizations with a discovery mechanism similar to pytest and other such packages; and it uses templates for the generation of the necessary Python package files. The end result also provides installation targets for the both the underlying C++ library and the resulting bindings. Finally, it’s portable enough that I even have it running and continuous integration.

The Second pass: knn-nearest-neighbors-example

The bbhash example above, while more complex than the existing toy example, does not use a dynamically linked library: rather, the underling header-only library is stuck into a static library and bundled directly into the bindings’ shared library. I figured I should also apply my work to the existing example, and at the same time fix the previously mentioned issue, so I went ahead and used the same project structure from cppyy-bbhash for a new knn-example. This time, a shared library is created for the underlying C++, and dynamically linked to the bindings library.

The Third pass: a cookiecutter template

Now that I’d worked out most of the kinks, I figured I ought to make usage a bit easier. So, I created a cookiecutter recipe that will sketch out a basic project structure with my CMake modules and packaging templates. This is a work in progress, but is sufficient to reproduce the previous two repositories.

fin. Well, not really.

I plan to continue work on improving the cookiecutter template and ironing out any more kinks. And of course, I now have to finish applying this work to my own project, as I set out to do in the first place! Finally, I plan on working up a conda recipe demonstrating distribution, so that hopefully, one will soon be able to do a simple conda install mybindings. Look for a future post on that front :)

If you got this far, thanks for your patience and happy hacking!

Function-level fixture parametrization (or, some pytest magic)

In my work my more recent work with de Bruijn graphs, I’ve been making heavy use of py.test fixtures and parametrization to generate sequence graphs of a particular structure. Fixtures are a wonderful bit of magic for performing test setup and dependency injection, and their cascading nature (fixtures using fixtures!) means a few can be recombined in myriad ways. This post will assume you’ve already bowed to the wonder of fixtures and have some close familiarity with them; if not, it will appear to you as cosmic horror – which maybe it is, but cosmic horror never felt so good.

The Problem

I’ve got a bunch of fixtures, heavily parametrized, which are all composed. For example, I have one for generating varying flavors of our de Bruijn Graph (dBG) objects:

@pytest.fixture(params=[khmer.Nodegraph, khmer.Countgraph],
                ids=['(Type=Nodegraph)', '(Type=Countgraph)'])
def graph(request, ksize):
    ''' Instantiate a dBG with the parametrized
    underlying dBG type.
    ''' 
    num_kmers = 50000
    target_fp = 0.00001
    args = optimal_fp(num_kmers, target_fp)

    return request.param(ksize, args.htable_size,
                         args.num_htables) 

For those not familiar with the dBG, for our purposes it is a graph where the nodes are sequences of length \(K\) for some alphabet \(\Sigma\) (in our case, \(\Sigma = \{A, C, G, T\}\)). We draw an edge \(e_i = v_j \rightarrow v_k\) if the length \(K-1\) suffix of \(v_j\) matches the length \(K-1\) prefix of \(v_k\). This turns out to be highly useful when we want take a pile of highly redundant short random samples of an underlying sequence and try to extract something close to the underlying sequence. A more in-depth discussion of dBGs is, uh, left as an exercise to the reader – what we really care about here is \(K\). It seems to be showing up often: as an argument to our dBG objects, as a way to prevent loops and overlaps in our randomly generated sequences, and, as it turns out, all over our tests for various indexing operations.

And so, there’s also fixture for generating random nucleotide sequences that don’t overlap in a dBG of order \(K\):

@pytest.fixture(params=list(range(500, 1600, 500)),
                ids=lambda val: '(L={0})'.format(val))
def random_sequence(request, ksize):
    global_seen = set()
    def get(exclude=None):
        # note that the definition of get random sequence
        # is also left as an exercise to the reader
        sequence = get_random_sequence(request.param,
                                       ksize,
                                       exclude=exclude,
                                       seen=global_seen)         
        for i in range(len(sequence)-ksize):      
            global_seen.add(sequence[i:i+ksize-1])
            global_seen.add(revcomp(sequence[i:i+ksize-1]))
        return sequence

    return get

…and naturally, one to compose them:

@pytest.fixture
def linear_structure(request, graph, ksize, random_sequence):
    '''Sets up a simple linear path graph structure.

    sequence
    [0]→o→o~~o→o→[-K]
    '''
    sequence = random_sequence()
    graph.consume(sequence)

    # Check for false positive neighbors in our graph:
    # this linear path should have no "high degree nodes"
    # Mark as an expected failure if any are found
    if hdn_counts(sequence, graph):
        request.applymarker(pytest.mark.xfail)

    return graph, sequence

You might notice a few things about these fixtures:

  1. OMG you’re testing with random data. Yes I am! But, the space for that data is highly constrained, it seems to be doing the job well, and I can always introspect unexpected failures.
  2. The random_sequence fixture returns a function! This is a trick to share some state at function scope: we keep track of the global set of seen k-mers, and the resulting function can generate many sequences.
  3. There’s an undefined parameter or fixture: ksize.

The last bit is the interesting part.

So, of course, it seems we should just write a fixture for \(K\)! The simplest approach might be:

@pytest.fixture
def ksize():
    return 21

This sets one default \(K\) for each fixture and test using it. This kinda sucks though: we should be testing different \(K\) sizes! So…

@pytest.fixture(params=[21, 25, 29])
def ksize(request):
    return request.param

Slightly better! We test three values for \(K\) instead of one. Unfortunately, it still doesn’t quite cut it: for some tests we want more a more trivial dBG ( say with \(K=4\)), or we might not want or need to have three instances of every single test. We need individual tests to be able to set their own \(K\), and importantly, it still needs to trickle down to all the fixtures the test depends on. I’d also like this to be somewhat clear and concise: turns out that what I’m about to show you
can more or less be achieved with indirect parametrization, but I find that interface clunky (and not very well documented) , and besides, this taught me a bit more about pytest.

The Solution

My first thought was that it’d be nice to just set a variable within a test function and reach through the request object to pull it out with getattr, which would produce tests something like:

def test_something(graph, linear_structure):
    ksize = 5
    # stuff
    assert condition == True

Turns out this doesn’t work properly with test collection and Python’s scoping rules, and just feels icky to boot. We need a way to pass some information to the fixture, while also making it clear that it’s a property of the test itself and not some detail of the test’s implementation. Then I realized: decorators!

So, I came up with this:

def using_ksize(K):
    '''A decorator to set the _ksize 
    attribute for individual tests.
    '''
    def wrap(func):                                      
        setattr(func, '_ksize', K)
        return func                                      
    return wrap

Pretty straightforward: all it does is add an attribute called _ksize to the test function. However, we need to tell pytest and our fixtures about it. Turns out that the pytest API already has a hook for more granular control over parametrization, called pytest_generate_tests. This lets us grab the fixtures being used by whatever function pytest is currently setting up and poke at their generation in various ways. For example, in my case…

# goes in conftest.py

def pytest_generate_tests(metafunc):
    if 'ksize' in metafunc.fixturenames:
        ksize = getattr(metafunc.function, '_ksize', None)
        if ksize is None:
            ksize = [21]
        if isinstance(ksize, int):
            ksize = [ksize]
        metafunc.parametrize('ksize', ksize,  
                             ids=lambda k: 'K={0}'.format(k))

So what is this nonsense? We look at the metafunc, which contains the requesting context, and into its list of fixture names. If we find one called ksize, we check the calling function in metafunc.function for the _ksize attribute; if we don’t find it, we set a default value, and if we do, we just use it.

Now, we can write a couple different sorts of tests:

def test_something_default_ksize(ksize, random_sequence):
    # do stuff...
    assert ksize == 21

@using_ksize(25)
def test_something_override_ksize(ksize, random_sequence):
    # random sequence also gets the
    # override in its ksize fixture
    assert ksize == 25

@using_ksize([21,31])
def test_something_ksize_parametrize(ksize, random_sequence):
    # this one gets properly parametrized for 21 and 31
    assert condition

I rather like this approach: it’s quite clear and retains all the pytest fixture goodness, while also giving more granular control. This is a simple parametrization case which admittedly could be accomplished with indirect parametrization, but one could imagine scenarios where the indirect method would be insufficient. Curiously, you don’t even have to write an actual fixture function with this approach, as its implied by the function argument lists.

In my case, I eventually get tests like…

tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-Start-(Type=Nodegraph)-K=21] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-Start-(Type=Nodegraph)-K=31] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-Start-(Type=Countgraph)-K=21] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-Start-(Type=Countgraph)-K=31] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-End-(Type=Nodegraph)-K=21] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-End-(Type=Nodegraph)-K=31] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-End-(Type=Countgraph)-K=21] PASSED
tests/test_compact_dbg.py::test_compact_triple_fork[(L=1500)-End-(Type=Countgraph)-K=31] PASSED

In the end, only mild cosmic horror.

SciPy 2016 Retrospective

Last week I attended my first SciPy conference in Austin. I’ve been to the past three PyCons in Montreal and Portland, and aside from my excitement to learn more about the great scientific Python community, I was curious to see how it compared to the general conference I’ve come to know and love.

SciPy, by my account, is a curious microcosm of the academic open source community as a whole. It is filled with great people doing amazing work, releasing incredible tools, and pushing the frontiers of features and accessibility in scientific software. It is also marked by some of the same problems as the larger community: a stark lack of gender (and other) diversity and a surprising (or not) lack of consciousness of the problem. I’ll start by going over some of the cool projects I learned about and then move on to some thoughts on the gender issue.

Cool Stuff

nbflow

Several new projects were announced, and several existing projects were given some needed visibility. The first I’ll talk about is nbflow. This is Jessica Hamrick’s system for “one-button button reproducible workflows with the Jupyter notebook and scons.” In short, you can link up notebooks in build system via two special variables in the first cells of a collection of notebooks – __depends__ and __dest__ – which contain lists of source and target filenames and are parsed out of the JSON to automagically generate build tasks. Jessica’s implementation is clean and can be pretty easily grokked with only a few minutes of reading the code, and it’s intuitive and relatively well tested. She delivered a great presentation with excellent slides and nice demos (which all worked ;)).

The only downside is that it uses scons, which isn’t Python 3 compatible and isn’t what I use, which must mean it’s bad or something. However, this turned out to be a non-issue due to the earlier point about the clean codebase: I was able to quickly build a pydoit module with her extractor, and she’s been responsive to the PR (thanks!). It would be pretty easy to build modules for any number of build systems – it only requires about 50 lines of code. I’m definitely looking forward to using nbflow in future projects.

JupyterLab

The Jupyter folks made a big splash with JupyterLab, which is currently in alpha. They’ve built an awesome extension API that makes adding new functionality dead simple, and it appears they’ve removed many of the warts from the current Jupyter client. State is seamlessly and quickly shipped around the system, making all the components fully live and interactive. They’re calling it an IDE – an Interactive Development Environment – and it will likely improve greatly upon the current Python data exploration workflow. It’s reminiscent of Rstudio in a lot of ways, which I think is a Good Thing; intuitive and simple interfaces are important to getting new users up and running with the language, and particularly helpful in the classroom. They’re shooting to have a 1.0 release out by next year’s SciPy, emphasizing that they’ll require a 1.0 to be squeaky clean. I’ll be anxiously awaiting its arrival!

Binder

Binder might be oldish news to many people at this point, but it was great to see it represented. For those not in the know, it allows you to spin up Jupyter notebooks on-demand from a github repo, specifying dependencies with Docker, PyPI, and Conda. This is a great boon for reproducibility, executable papers, classrooms, and the like.

Altair

The first keynote of the conference was yet another plotting library, Altair. I must admit that I was somewhat skeptical going in. The lament and motivation behind Altair was how users have too many plotting libraries to choose from and too much complexity, and solving this problem by introducing a new library invokes the obligatory xkcd. However, in the end, I think the move here is needed.

Altair a python interface to vega-lite; the API is a straight-forward plotting interface which spits out a vega-lite spec to be rendered by whatever vega-compatible graphics frontend the user might like. This is a massive improvement over the traditional way of using vega-lite, which is “simply write raw JSON(!)” It looks to have sane defaults and produce nice looking plots with the default frontend. More important, however, is the paradigm shift they are trying to initiate: that plotting should be driven by a declarative grammar, with the implementation details left up to the individual libraries. This shifts much of the programming burden off the users (and on to the library developers), and would be a major step toward improving the state of Python data visualization.

Imperative (hah!) to this shift is the library developers all agreeing to use the same grammar. Several of the major libraries (bokeh and plot.ly?) already use bespoke internal grammars, and according to the talk, looking to adopt vega. Altair has taken the aggressive approach: the tactic seems to be to firmly plant the graphics grammar flag and force the existing tools to adopt before they have a chance to pollute the waters with competing standards. Somebody needed to do it, and I think it’s better that vega does.

There are certainly deficiencies though. vega-lite is relatively spartan at this point – as one questioner in the audience highlighted, it can’t even put error bars on plots. This sort of obvious feature vacuum will need to be rapidly addressed if the authors expect the spec to be adopted wholeheartedly by the scientific python community. Given the chops behind it, I fully expect these issues to be addressed.

Gender Stuff

I’ve focused on the cool stuff at the conference so far, but not everything was so rosy. Let’s talk about diversity – of the gender sort, but the complaint applies to race, ability, and so forth.

There’s no way to state this other than frankly: it was abysmal. I immediately noticed the sea of male faces, and a friend of mine had at least one conversation with a fellow conference attendee while he had a conversation with her boobs. The Code of Conduct was not clearly stated at the beginning of the conference, which makes a CoC almost entirely useless: it shows potential violators that the organizers don’t really prioritize the CoC and probably won’t enforce it, and it signals the same to the minority groups that the conference ostensibly wants to engage with. As an example, while Chris Calloway gave a great lightening talk about how PyData North Carolina is working through the aftermath of HB2, several older men directly behind me giggled amongst themselves at the mention of gender neutral bathrooms. They probably didn’t consider that there was a trans person sitting right in front of them, and they certainly didn’t consider the CoC, given that it was hardly mentioned. This sort of shit gives all the wrong signals for folks like myself. At PyCon the previous two years, I felt comfortable enough to create a #QueerTransPycon BoF, which was well attended; although the more focused nature of SciPy makes such an event less appropriate, I would not have felt comfortable trying regardless.

The stats are equally bad: 12 out of 124 speakers, 8 out of 52 poster presenters, and 4 out of 37 tutorial presenters were women, and the stats are much worse for people of color. The lack of consciousness of the problem was highlighted by some presenters noting the great diversity of the conference (maybe they were talking about topics?), and in one case, by the words of an otherwise well-meaning man whom I had a conversation with; when the 9% speaker rate for women was pointed out to him, he pondered the number and said that it “sounded pretty good.” It isn’t! He further pressed as to whether we would be satisfied once it hit 50%; somehow the “when is enough enough?” question always comes up. What’s clear is that “enough” is a lot more than 9%. This state of things isn’t new – several folks have written about it in regards to previous years.

There are some steps that can be taken here – organizers could look toward the PSF’s successful efforts to improve the gender situation at PyCon, where funding was sought for a paid chair (as opposed to SciPy’s unpaid position). The Code of Conduct should be clearly highlighted and emphasized at the beginning of the conference. For my part, I plan to submit a tutorial and a talk for next year.

I don’t want to only focus on the bad; the diversity luncheon was well attended, there was a diversity panel, and a group has been actively discussing the issues in a dedicated channel on the conference Slack team. These things signal that there is some will to address this. I also don’t want to give any indication that things are okay – they aren’t, and there’s a ton of work to be done.

Closing

I’m grateful to my adviser Titus for paying for the trip, and generally supporting my attending events like this and rabble rousing. I’m also grateful to the conference organizers for putting together an all-in-all good conference, and to all the funders present who make all this scientific Python software that much more viable and robust. For anyone reading this and thinking, “I’m doing thing X to combat the gender problem, why don’t you help out?” feel free to contact me on twitter.

some notes on py.test, travis-ci, and SciPy 2016

I’ve been in Austin since Tuesday for SciPy 2016, and after a couple weeks in Brazil and some time off the grid in the Sierras, I can now say that I’ve been officially bludgeoned back into my science and my Python. Aside from attending talks and meeting new people, I’ve been working on getting a little package of mine up to scratch with tests and continuous integration, with the eventual goal of submitting it to the Journal of Open Source Software. I had never used travis-ci before, nor had I used py.test in an actual project, and as expected, there were some hiccups – learn from mine to avoid your own :)

Note: this blog post is not beginner friendly. For a simple intro to continuous integration, check out our pycon tutorial, travis ci’s intro docs, or do further googling. Otherwise, to quote Worf: ramming speed!

travis

Having used drone.io in the past, I had a good idea of where to start here. travis is much more feature rich than drone though, and as such, requires a bit more configuration. My package, shmlast, is not large, but it has some external dependencies which need to be installed and relies on the numpy-scipy-pandas stack. drone’s limited configuration options and short maximum run time quickly make it intractable for projects with non-trivial dependencies, and this was where travis stepped in.

getting your scientific python packages

The first stumbling block here was deciding on a python distribution. Using virtualenv and PyPI is burdensome with numpy, scipy, and pandas – they almost always want to compile, which takes much too long. Being an impatient page-refreshing fiend, I simply could not abide the wait. The alternative is to use anaconda, which does us the favor of compiling them ahead of time (while also being a little smarter about managing dependencies). The default distribution is quite large though, so instead, I suggest using the stripped-down miniconda and installing the packages you need explicitly. Detailed instructions are available here, and I’ll run through my setup.

The miniconda setup goes under the install directive in your .travis.yml:

install:
  - sudo apt-get update
  - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
      wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -O miniconda.sh;
    else
      wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
    fi
  - bash miniconda.sh -b -p $HOME/miniconda
  - export PATH="$HOME/miniconda/bin:$PATH"
  - hash -r
  - conda config --set always_yes yes --set changeps1 no
  - conda update -q conda
  - conda info -a
  - conda create -q -n test python=$TRAVIS_PYTHON_VERSION numpy scipy pandas=0.17.0 matplotlib pytest pytest-cov coverage sphinx nose
  - source activate test
  - pip install -U codecov
  - python setup.py install

Woah! Let’s break it down. Firstly, there’s a check of travis’s python environment variable to grab the correct miniconda distribution. Then we install it, add it to PATH, and configure it to work without interaction. The conda info -a is just a convenience for debugging. Finally, we go ahead and create the environment. I do specify a version for Pandas; if I were more organized, I might write out a conda environment.yml and use that instead. After creating the environment and installing a non-conda dependency with pip, I install the package. This gets use ready for testing.

After a lot of fiddling around, I believe this is the fastest way to get your Python environment up and running with numpy, scipy, and pandas. You can probably safely use virtualenv and pip if you don’t need to compile massive libraries. The downside is that this essentially locks your users into the conda ecosystem, unless they’re willing to risk going it alone re: platform testing.

non-python stuff

Bioinformatics software (or more accurately, users…) often have to grind their way through the Nine Circles (or perhaps orders of magnitude) of Dependency Hell to get software installed, and if you want CI for your project, you’ll have to automate this devilish journey. Luckily, travis has extensive support for this. For example, I was easily able to install LAST aligner from source by adding some commands under before_script:

before_script:
  - curl -LO http://last.cbrc.jp/last-658.zip
  - unzip last-658.zip
  - pushd last-658 && make && sudo make install && popd

The source is first downloaded and unpacked. We need to avoid mucking up our current location when compiling, so we use pushd to save our directory and move to the folder, then make and install before using popd to jump back out.

Software from Ubuntu repos is even simpler. We can these commands to before_install:

before_install:
  - sudo apt-get -qq update
  - sudo apt-get install -y emboss parallel

This grabbed emboss (which includes transeq, for 6-frame DNA translation) and gnu-parallel. These commands could probably just as easily go in the install section, but the travis docs recommended they go here and I didn’t feel like arguing.

py.test

and the import file mismatch

I’ve used nose in my past projects, but I’m told the cool kids (and the less-cool kids who just don’t like deprecated software) are using py.test these days. Getting some basic tests up and running was easy enough, as the patterns are similar to nose, but getting everything integrated was more difficult. Pretty soon, after running a python setup.py test or even a simple py.test, I was running into a nice collection of these errors:

import file mismatch:
imported module 'shmlast.tests.test_script' has this __file__ attribute:
  /work/shmlast/shmlast/tests/test_script.py
which is not the same as the test file we want to collect:
  /work/shmlast/build/lib/shmlast/tests/test_script.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules

All the google results for this were to threads with devs and other benevolent folks patiently explaining that you need to have unique basenames for your test modules (I mean it’s right there in the error duh), or that I needed to delete __pycache__. My basenames were unique and my caches clean, so something else was afoot. An astute reader might have noticed that one of these paths given is under the build/ directory, while the other is in the root of the repo. Sure enough, deleting the build/ directory fixes the problem. This seemed terribly inelegant though, and quite silly for such a common use-case.

Well, it turns out that this problem is indirectly addressed in the docs. Unfortunately, it’s 1) under the obligatory “good practices” section, and who goes there? and 2) doesn’t warn that this error can result (instead there’s a somewhat confusing warning telling you not to use an __init__.py in your tests subdirectory, but also that you need to use one if you want to inline your tests and distribute them with your package). The problem is that py.test happily slurps up the tests in the build directory as well as the repo, which triggers the expected unique basename error. The solution is to be a bit more explicit about where to find tests.

Instead of running a plain old py.test, you run py.test --pyargs <pkg>, which in clear and totally obvious language in the help is said to make py.test “try to interpret all arguments as python packages.” Clarity aside, it fixes it! To be extra double clear, you can also add a pytest.ini to your root directory with a line telling where the tests are:

[pytest]
testpaths = path/to/tests

organizing test data

Other than documentation gripes, py.test is a solid library. Particularly nifty are fixtures, which make it easy to abstract away more boilerplate. For example, in the past I’ve use the structure of our lab’s khmer project for grabbing test data and copying it into temp directories, but it involves a fair amount of code and bookkeeping. With a fixture, I can easily access test data in any test, while cleaning up the garbage:

@fixture
def datadir(tmpdir, request):
    '''
    Fixture responsible for locating the test data directory and copying it
    into a temporary directory.
    '''
    filename = request.module.__file__
    test_dir = os.path.dirname(filename)
    data_dir = os.path.join(test_dir, 'data') 
    dir_util.copy_tree(data_dir, str(tmpdir))

    def getter(filename, as_str=True):
        filepath = tmpdir.join(filename)
        if as_str:
            return str(filepath)
        return filepath

    return getter

Deep in my heart of hearts I must be a functional programmer, because I’m really pleased with this. Here, we get the path to the tests directory, and then the data directory which it contains. The test data is then all copied to a temp directory, and by the awesome raw power of closures, we return a function which will join the temp dir with a requested filename. A better version would handle a nonexistant file, but I said raw power, not refined and domesticated power. Best of all, this fixture uses another fixture, the builtin tmpdir, which makes sure then files get blown away when you’re done with them.

Use it as a fixture in a test in the canonical way:

def test_test(datadir):
    test_filename = datadir('test.csv')
    assert True # professional coder on a closed course

Next up: thoughts on SciPy 2016!