{"found":50584,"hits":[{"document":{"authors":[{"affiliation":[{"id":"https://ror.org/050qmg959","name":"Singapore Management University"}],"contributor_roles":[],"family":"Tay","given":"Aaron","url":"https://orcid.org/0000-0003-0159-013X"}],"blog":{"authors":null,"community_id":"f34e2211-9904-4b58-97ab-0beeb79ef6f7","created":1697068800,"current_feed_url":null,"description":"Aaron Tay's thoughts about academic librarianship","favicon":"https://rogue-scholar.org/api/communities/f34e2211-9904-4b58-97ab-0beeb79ef6f7/logo","feed_format":"application/rss+xml","feed_url":"https://aarontay.substack.com/feed","filter":null,"generator":"Substack","home_page_url":"https://aarontay.substack.com","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"musings","status":"active","subfield":"3309","title":"Aaron Tay's Musings about Librarianship","updated":1781540136,"use_api":true},"blog_name":"Aaron Tay's Musings about Librarianship","blog_slug":"musings","content_html":"<p><em>This post is part of a \"hot takes\" series in which I make sharper claims than I usually do. I do not intend to offend, and I am not trying to tar every librarian with the same brush \u2014 the patterns I describe and perceive may be a function of my own local context. </em></p><p><em>In my last hot takes post, <a href=\"https://aarontay.substack.com/p/hot-take-stop-calling-poor-search\">I argued that while designing a search system to maximise learning gains may not always align with designing a search system that scores the best for relevancy, unplanned friction in learning aka poor relevancy is never a good idea.</a></em></p><p><em>In this post, I consider the idea of tools that are considered flawed because they can't match the performance of an older tool in some way<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-1\" href=\"#footnote-1\" target=\"_self\">1</a>.</em></p><p class=\"button-wrapper\" data-attrs=\"{&quot;url&quot;:&quot;https://ko-fi.com/aarontay&quot;,&quot;text&quot;:&quot;Buy me coffee via ko-fi!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}\" data-component-name=\"ButtonCreateButton\"><a class=\"button primary\" href=\"https://ko-fi.com/aarontay\"><span>Buy me coffee via ko-fi!</span></a></p><div class=\"captioned-image-container\"><figure><a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"https://substackcdn.com/image/fetch/$s_!ciEX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png\" data-component-name=\"Image2ToDOM\"><div class=\"image2-inset\"><picture><source type=\"image/webp\" srcset=\"https://substackcdn.com/image/fetch/$s_!ciEX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 424w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 848w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 1272w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 1456w\" sizes=\"100vw\"><img src=\"https://substackcdn.com/image/fetch/$s_!ciEX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png\" width=\"1134\" height=\"651\" data-attrs=\"{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:651,&quot;width&quot;:1134,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" class=\"sizing-normal\" alt=\"\" srcset=\"https://substackcdn.com/image/fetch/$s_!ciEX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 424w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 848w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 1272w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 1456w\" sizes=\"100vw\" fetchpriority=\"high\"></picture><div class=\"image-link-expand\"><div class=\"pencraft pc-display-flex pc-gap-8 pc-reset\"><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container restack-image\"><svg role=\"img\" width=\"20\" height=\"20\" viewBox=\"0 0 20 20\" fill=\"none\" stroke-width=\"1.5\" stroke=\"var(--color-fg-primary)\" stroke-linecap=\"round\" stroke-linejoin=\"round\" xmlns=\"http://www.w3.org/2000/svg\"><g><title></title><path d=\"M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882\"></path></g></svg></button><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container view-image\"><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"20\" height=\"20\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"lucide lucide-maximize2 lucide-maximize-2\"><polyline points=\"15 3 21 3 21 9\"></polyline><polyline points=\"9 21 3 21 3 15\"></polyline><line x1=\"21\" x2=\"14\" y1=\"3\" y2=\"10\"></line><line x1=\"3\" x2=\"10\" y1=\"21\" y2=\"14\"></line></svg></button></div></div></div></a></figure></div><p>Imagine you are a librarian in 2004, <a href=\"https://googleblog.blogspot.com/2004/10/scholarly-pursuits.html\">when Google Scholar launches in beta</a>.<br>You have read the studies. <a href=\"https://www.emerald.com/oir/article/29/2/208/315378/Google-Scholar-the-pros-and-the-cons\">The coverage gaps are real</a>. <a href=\"https://vlex.co.uk/vid/metadata-mega-mess-in-846721887\">The metadata is wretched</a>. Nobody, including Google, can tell you exactly what is indexed<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-2\" href=\"#footnote-2\" target=\"_self\">2</a>. </p><p>You decide it is the dumbest thing on the planet and declare that it could never be useful to anyone, despite a vocal minority of users insisting otherwise.</p><p>Fast forward to 2024. Google Scholar has become the default academic search starting point for many researchers and is widely regarded as the most comprehensive free academic search engine<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-3\" href=\"#footnote-3\" target=\"_self\">3</a>. Your first assumption might be that Google fixed all the problems.</p><p>But that assumption would be mostly wrong.</p><p>Sure, Google Scholar improved, especially in coverage. Its metadata may have arguably also have improved. But you still cannot consult a stable, auditable list of indexed sources or records. Even Google's own guidance effectively tells users to test coverage by sampling titles rather than by checking a definitive inventory. One of the fundamental weaknesses librarians diagnosed in 2004 is still there.</p><p>And yet the librarians and researchers of 2024 are not idiots.</p><p>What happened is simple. Users learnt to compensate. They used Scholar for what it was good at and routed around its flaws or used another tool for other use cases (e.g. Evidence synthesis). They learnt when its metadata could not be trusted, when its coverage was opaque, and when to complement with another database was needed. The tool did not become perfect. It became useful enough, along side other tools.</p><div class=\"captioned-image-container\"><figure><a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"https://substackcdn.com/image/fetch/$s_!_kqs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png\" data-component-name=\"Image2ToDOM\"><div class=\"image2-inset\"><picture><source type=\"image/webp\" srcset=\"https://substackcdn.com/image/fetch/$s_!_kqs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 424w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 848w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 1272w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 1456w\" sizes=\"100vw\"><img src=\"https://substackcdn.com/image/fetch/$s_!_kqs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png\" width=\"999\" height=\"632\" data-attrs=\"{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:999,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:814374,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aarontay.substack.com/i/200014373?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" class=\"sizing-normal\" alt=\"\" srcset=\"https://substackcdn.com/image/fetch/$s_!_kqs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 424w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 848w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 1272w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 1456w\" sizes=\"100vw\" loading=\"lazy\"></picture><div class=\"image-link-expand\"><div class=\"pencraft pc-display-flex pc-gap-8 pc-reset\"><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container restack-image\"><svg role=\"img\" width=\"20\" height=\"20\" viewBox=\"0 0 20 20\" fill=\"none\" stroke-width=\"1.5\" stroke=\"var(--color-fg-primary)\" stroke-linecap=\"round\" stroke-linejoin=\"round\" xmlns=\"http://www.w3.org/2000/svg\"><g><title></title><path d=\"M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882\"></path></g></svg></button><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container view-image\"><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"20\" height=\"20\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"lucide lucide-maximize2 lucide-maximize-2\"><polyline points=\"15 3 21 3 21 9\"></polyline><polyline points=\"9 21 3 21 3 15\"></polyline><line x1=\"21\" x2=\"14\" y1=\"3\" y2=\"10\"></line><line x1=\"3\" x2=\"10\" y1=\"21\" y2=\"14\"></line></svg></button></div></div></div></a></figure></div><p></p><p>That trajectory is suprisingly common with new technology and one I think about  when I read arguments thay say AI or any technology tools can never be useful because of some \"fundamental flaw\".</p><h2>The concession has already been made on usefulness</h2><div class=\"captioned-image-container\"><figure><a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"https://substackcdn.com/image/fetch/$s_!z3Ha!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png\" data-component-name=\"Image2ToDOM\"><div class=\"image2-inset\"><picture><source type=\"image/webp\" srcset=\"https://substackcdn.com/image/fetch/$s_!z3Ha!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 424w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 848w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 1272w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 1456w\" sizes=\"100vw\"><img src=\"https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png\" width=\"1456\" height=\"740\" data-attrs=\"{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:740,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1543664,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aarontay.substack.com/i/200014373?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" class=\"sizing-normal\" alt=\"\" srcset=\"https://substackcdn.com/image/fetch/$s_!z3Ha!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 424w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 848w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 1272w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 1456w\" sizes=\"100vw\" loading=\"lazy\"></picture><div class=\"image-link-expand\"><div class=\"pencraft pc-display-flex pc-gap-8 pc-reset\"><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container restack-image\"><svg role=\"img\" width=\"20\" height=\"20\" viewBox=\"0 0 20 20\" fill=\"none\" stroke-width=\"1.5\" stroke=\"var(--color-fg-primary)\" stroke-linecap=\"round\" stroke-linejoin=\"round\" xmlns=\"http://www.w3.org/2000/svg\"><g><title></title><path d=\"M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882\"></path></g></svg></button><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container view-image\"><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"20\" height=\"20\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"lucide lucide-maximize2 lucide-maximize-2\"><polyline points=\"15 3 21 3 21 9\"></polyline><polyline points=\"9 21 3 21 3 15\"></polyline><line x1=\"21\" x2=\"14\" y1=\"3\" y2=\"10\"></line><line x1=\"3\" x2=\"10\" y1=\"21\" y2=\"14\"></line></svg></button></div></div></div></a></figure></div><p>Some of the most prominent sceptics have already said as much.</p><p>Gary Marcus, cognitive scientist, author of Rebooting AI, and one of the most consistent public critics of AI hype, has repeatedly acknowledged that LLMs can be useful, especially for coding, brainstorming and writing, while arguing that they are unreliable and not a route to AGI alone.</p><p>Margaret Mitchell, a co-author of the influential \"<a href=\"https://dl.acm.org/doi/10.1145/3442188.3445922\">Stochastic Parrots</a>\" paper, has been even more explicit: <a href=\"https://medium.com/@margarmitchell/no-ai-is-not-a-stochastic-parrot-a99e57766bed\">LLMs can be \"extremely useful\"</a>.</p><p><a href=\"https://medium.com/@emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11\">Emily M. Bender has likewise clarified that \"stochastic parrot\" is a description or metaphor, not an argument that these systems have no possible utility</a><a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-4\" href=\"#footnote-4\" target=\"_self\">4</a><a href=\"https://medium.com/@emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11\">.</a></p><p>Mike Caulfield, of SIFT fame, is actively studying and using these tools. These are not AI boosters. If even they concede that the tools can be useful, the basic question has already been settled.</p><p>You can oppose LLM use on environmental grounds, labour grounds, epistemic grounds, or any number of other defensible grounds. </p><p>But the claim that these tools can never be useful to anyone has moved past argument into something closer to an unfalsifiable position. No demonstration of utility, no improvement in the tools, and no evidence of successful use by researchers seems able to update it. </p><h2>A note on scope</h2><p>For the rest of this post, when I say \"AI tools\", I mean AI-powered academic search tools. I work in this space, so I will stay in my lane. The argument may extend elsewhere, but I am not pretending to make that case here.</p><p>It is also worth being precise about what \"AI search\" means. <a href=\"https://aarontay.substack.com/p/what-do-we-actually-mean-by-ai-powered\">The term covers several different things: changing what gets retrieved, reranking results, summarising content, and generating direct answers to questions</a>. These are not the same capability and do not carry the same risks.</p><p>A librarian who objects to generative answer synthesis is making a different argument from one who objects to AI-assisted reranking. Conflating the two muddles the debate. Before objecting to \"AI search\", it is worth saying which part concerns you, and why.</p><p>The Google Scholar analogy maps most cleanly onto AI-assisted retrieval and reranking: helping surface relevant results that users might otherwise miss. It also maps reasonably well onto \"tip-of-the-tongue\" search, one of the limited uses Bender has acknowledged as potentially useful.</p><p>It maps less directly onto generative answer synthesis, where hallucination risks are sharper. <a href=\"https://aarontay.substack.com/p/what-do-we-actually-mean-by-ai-powered\">I am not arguing that all uses of AI in search carry equal risk</a>. I am arguing that even the riskiest versions clear the \"never useful\" bar. The appropriate response to different risk profiles is <a href=\"https://aarontay.substack.com/p/why-use-of-new-ai-enhanced-tools-that\">differentiated teaching, not blanket rejection.</a></p><h2>Back to the analogy</h2><p>The people who insisted in 2004 that Google Scholar could never be useful until Google published full holdings lists were sure they were right. But they were eventually proven wrong to conclude that a tool could never be useful without that.</p><p><em>A tool need not be perfect to be useful</em>. This sounds obvious, but it is the point that keeps getting lost.</p><blockquote><p>One objection is that the Scholar analogy fails because LLM errors are different. Google Scholar had messy metadata and opaque coverage. LLMs produce overconfident hallucinations.</p><p>That objection has force, but notice what it actually supports. It supports teaching verification skills. It supports scaffolding. It supports appropriate scepticism. It does not support the early conclusion that the tool is useless.</p><p>A tool that requires careful handling is not the same as a tool that cannot be useful. The verification argument also cuts both ways. Uncritical acceptance of search results, including Google Scholar results, has always been the failure mode librarians teach against.</p><p>Perhaps the scaffolding still will not be enough. But I am sure many librarians who rejected Google Scholar in 2004 were pretty sure too.</p></blockquote><p>Nor does it help that some of the most vocal sceptics seem not to have engaged seriously with these tools since 2023. They appear to underestimate how much the systems, and the harnesses around them, have changed even just three years. But even that point is secondary. The Scholar parallel does not depend on the exact pace of improvement<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-5\" href=\"#footnote-5\" target=\"_self\">5</a>.  </p><blockquote><p>I currently believe from some experience setting up and testing agents, that between the use of code to constrain the model and aggressive multi-validation checks, you can reduce the probability of error/hallucatins down to very low levels comparable to the human.</p></blockquote><p>The lesson is not that tools improve. The lesson is that what counts as good enough is not always obvious at first.</p><h2>Three objections </h2><p>Some librarians argue that the Google Scholar comparison to AI breaks down on three grounds: librarians do not actively promote Scholar (but librarians actively promote AI); users genuinely want Scholar rather than having it pushed on them; and Scholar is free, whereas many AI tools are commercial products.</p><p>None of these objections does the work required.</p><p>On promotion, many librarians do promote Google Scholar. It appears in LibGuides, instruction sessions and one-on-one consultations. The claim that \"we do not promote it\" is a polite fiction. Beyond what is said publicly, plenty of academic librarians reach for Google, Scholar or Wikipedia first in their own work when the situation calls for it.</p><p>On user demand, users clearly want AI tools. Whatever one thinks of that demand, pretending it does not exist is not a serious position.</p><p>On cost, being free or paid is a separate question from whether a tool can be useful. There are genuine concerns about commercial AI: vendor lock-in, inequity of access, environmental cost, labour implications, surveillance, and the commercial capture of scholarly infrastructure. Those concerns deserve serious engagement.</p><p>But they are arguments about adoption, governance and institutional support. They are not arguments that the tools can never be useful.</p><p>And let us be honest about the comparison. Library databases have plenty of flaws: idiosyncratic interfaces, uneven indexing, opaque relevance ranking, and sometimes weak metadata. We still pay substantial sums for them and promote them as a matter of course. The objection to AI tools cannot simply be that they are commercial and imperfect, because by that standard half the collection budget becomes difficult to defend. <strong> </strong></p><h2>The \"abusing trust\" argument</h2><p>There is a related claim that deserves a direct response: librarians who teach users how to use AI search tools are abusing professional trust because the tools are imperfect and can lead to errors.</p><p>This is a bad argument dressed up as an ethical one. It rests entirely on the premise that AI search tools are imperfect, as though that distinguished them from anything else we teach.</p><p>e.g. Name the flawless tool, we libraries promote?</p><p><em>Our job, properly understood, is to teach people to use imperfect tools well, </em>with appropriate scepticism and a clear understanding of what each tool can and cannot do. Refusing to teach AI search tools because they are flawed is not an act of professional integrity. It is an abdication of the actual job.</p><p>It also leaves users to figure these tools out on their own. That is the worse outcome by every measure.</p><h2>The badly understood \"Stochastic Parrots\" argument</h2><p>A common argument among librarians goes something like this: Emily Bender says LLMs are stochastic parrots; therefore, LLMs can never be useful.</p><p>There is an immediate problem with that argument. Even if you accept the \"stochastic parrot\" description, it does not tell you whether LLMs combined with other technologies can be useful. It says nothing directly about retrieval-augmented generation, tool use, calculators, citation validators, structured workflows, human review, or other harnesses wrapped around the model.</p><p>The more damaging problem is that <a href=\"https://medium.com/@emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11\">Bender herself has clarified that \"stochastic parrot\" is not an argument that LLMs are useless</a>. In her account, it is not even an empirical hypothesis. It is a description or metaphor for systems that generate fluent linguistic form without grounding in communicative intent, a model of the world, or a model of the reader's state of mind<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-6\" href=\"#footnote-6\" target=\"_self\">6</a>. </p><p>This does not mean Bender thinks LLMs are broadly useful. <a href=\"https://www.fastmail.com/digitalcitizen/exploring-ai-with-emily-m-bender/\">Her position is far more sceptical than that</a>. She has warned that synthetic text is not an information source, and that <a href=\"https://buttondown.com/maiht3k/archive/information-literacy-and-chatbots-as-search/\">using chatbots as reliable sources of knowledge is a serious category mistake.</a></p><p>But <a href=\"https://www.fastmail.com/digitalcitizen/exploring-ai-with-emily-m-bender/\">she has acknowledged limited possible uses, including \"tip-of-the-tongue\" search, language-learning dialogue partners, non-player characters in games, and non-generative uses of language models in classification, speech recognition and machine translation</a>. She treats summarisation more cautiously, because it can introduce material not present in the source. </p><p>Nor does this mean Bender has retreated from the stronger \"form versus meaning\" argument. In <a href=\"https://aclanthology.org/2020.acl-main.463/\">Bender and Koller's 2020 paper</a>, understanding is defined as mapping language to something outside language. Their claim is that a system trained only on linguistic form has no basis for learning that mapping, because it has access only to patterns in text, not to the extra-linguistic world those texts are about.</p><p>That is a serious argument. But it should not be flattened into the much weaker claim that LLMs can never be useful.</p><p>So the better conclusion is not \"stochastic parrots can never be useful\" (though she is currently very skeptical). It is: <strong>do not mistake fluent synthetic text for grounded understanding or reliable information</strong>. That is a much narrower, stronger, and useful warning but does not address the question on usefulness.</p><p>But it leaves room to ask the question that actually matters for librarians: under what conditions, with what scaffolding, for which tasks, and with what verification, can LLM-based systems be made useful rather than misleading?</p><p></p><h2>The lesson</h2><p>The lesson from Google Scholar is not that librarians should embrace every flawed tool users like. It is that \"flawed\" and \"useless\" are not synonyms.</p><p>It is hard to compare like for like, but I think it is fair to say that the practical gains in LLM-powered tools from 2023 to 2026 have been faster and larger than Google Scholar's improvements across its first decade. But the more important point is not the scale of improvement. It is that Google Scholar's improvement did not fundamentally fix its transparency problem. Instead, users learnt that the flaw was either less fatal than it first appeared, or manageable with the right habits.</p><p>That is the lesson librarians need to take seriously now.</p><p>If the objection is environmental cost, make the environmental argument. If it is labour exploitation, make the labour argument. If it is vendor lock-in, inequity, surveillance, weak governance, or commercial capture, make those arguments. They are serious enough to stand on their own.</p><p>They do not need a backdoor return to the claim that AI tools cannot really be useful.</p><p>That move is increasingly unconvincing. It usually begins with a reluctant concession: of course AI can sometimes be useful. Then, when the discussion turns to teaching, adoption or institutional support, the old premise quietly reappears. The tools are flawed, so using them must be irresponsible.</p><p class=\"button-wrapper\" data-attrs=\"{&quot;url&quot;:&quot;https://ko-fi.com/aarontay&quot;,&quot;text&quot;:&quot;Buy me coffee via ko-fi!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}\" data-component-name=\"ButtonCreateButton\"><a class=\"button primary\" href=\"https://ko-fi.com/aarontay\"><span>Buy me coffee via ko-fi!</span></a></p><p></p><p>But that is not how librarians treat tools.</p><p>We teach flawed systems all the time. We teach Google Scholar while warning about coverage and metadata. We teach Scopus and Web of Science while explaining their selectivity. We teach discovery layers while knowing their indexing and ranking are imperfect.</p><p>The professional act is not pretending tools are flawless. It is teaching people where they help, where they fail, and how to verify what matters.</p><p>So reject an AI tool because the cost is too high, the governance is too weak, the evidence is too thin, or the institutional incentives are wrong.</p><p>Just say that.</p><p>Do not dress those objections up as proof that the tool can never be useful. That argument has already lost.</p><p>The question now is not whether AI search tools can be useful. It is which uses are worth the cost, which are not, and what role librarians should play in helping users tell the difference.</p><p>  </p><div class=\"subscription-widget-wrap-editor\" data-attrs=\"{&quot;url&quot;:&quot;https://aarontay.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}\" data-component-name=\"SubscribeWidgetToDOM\"><div class=\"subscription-widget show-subscribe\"><div class=\"preamble\"><p class=\"cta-caption\">Thanks for reading Aaron Tay's Musings about Librarianship! Subscribe for free to receive new posts and support my work.</p></div><form class=\"subscription-widget-subscribe\"><input type=\"email\" class=\"email-input\" name=\"email\" placeholder=\"Type your email\u2026\" tabindex=\"-1\"><input type=\"submit\" class=\"button primary\" value=\"Subscribe\"><div class=\"fake-input-wrapper\"><div class=\"fake-input\"></div><div class=\"fake-button\"></div></div></form></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-1\" href=\"#footnote-anchor-1\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">1</a><div class=\"footnote-content\"><p>There are many similarities to the Innovator's Dilemma argument. Early users may value dimensions that experts discount, and a tool that performs poorly against established professional criteria may still become useful enough to reshape practice. But unlike the Innovator's Dilemma, I refer to cases where the alternatives co-exist.</p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-2\" href=\"#footnote-anchor-2\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">2</a><div class=\"footnote-content\"><p>Google Scholar offered no stable, auditable list of indexed sources or records, and even <a href=\"https://scholar.google.com/scholar/help.html#coverage\">Google's own guidance effectively tells users to test coverage by sampling titles rather than consult a definitive inventory. </a></p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-3\" href=\"#footnote-anchor-3\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">3</a><div class=\"footnote-content\"><p>As confirmed by many studies. The amount of full-text indexed by Google Scholar is also believed by many to be unmatched.</p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-4\" href=\"#footnote-anchor-4\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">4</a><div class=\"footnote-content\"><p>To be clear, Bender is not being cited here as making the same claim as Marcus or Mitchell that LLMs are broadly useful. <a href=\"https://www.fastmail.com/digitalcitizen/exploring-ai-with-emily-m-bender/\">Her position is much narrower and more sceptical.</a> In <a href=\"https://www.fastmail.com/digitalcitizen/exploring-ai-with-emily-m-bender/\">interviews</a>, She has said that safe and beneficial uses of synthetic text are hard to identify, but has offered tentative examples <em>especially \"tip of the tongue\" search,</em> where a user describes something in order to recover the name of it and can then verify it through ordinary search. She also distinguishes text generation from other uses of language models, saying that language models can have positive uses in classification, automatic speech recognition, and machine translation, while treating summarisation as more borderline because it can introduce material not present in the source. The point here is therefore not that Bender endorses LLMs as broadly useful, but that \"stochastic parrot\", in her own account, is not an empirical hypothesis or an argument that LLMs have no possible utility. It is a description or metaphor for language-mimicking systems, and the 2021 paper was about the risks and harms of pursuing ever-larger language models, not a general paper about \"AI\".</p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-5\" href=\"#footnote-anchor-5\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">5</a><div class=\"footnote-content\"><p>The improvement of LLMs between 2022 to 2026 is far larger than from 2004 to 2024! This improvement comes from both improvements in the models as well as the use of harnesses like Claude Code to combine deterministic code with LLM flexibility.</p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-6\" href=\"#footnote-anchor-6\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">6</a><div class=\"footnote-content\"><p>To be clear, this does not mean Bender has retreated from the stronger \"form versus meaning\" argument. In her account, the closest thing to an argument in this area is <a href=\"https://aclanthology.org/2020.acl-main.463/\">Bender and Koller's 2020 paper</a>, which defines understanding as mapping language to something outside language. Their claim is that a system trained only on linguistic form has no basis for learning that mapping, because it only has access to patterns in text, not to the extra-linguistic world those texts are about. This is separate from the \"stochastic parrots\" phrase itself, which Bender describes as a metaphor rather than an empirical hypothesis. She also notes that multimodal systems complicate the picture: image-text models may meet the Bender and Koller definition of understanding in a very thin sense, because they can map between linguistic strings and images. But she argues that the stochastic-parrot framing remains relevant to such systems and to systems built around them.</p><p></p></div></div>","doi":"https://doi.org/10.59350/xjc74-4s752","guid":"200014373","image":"https://substackcdn.com/image/fetch/$s_!ciEX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1781481600,"rid":"a0bcx-c8490","summary":"What 2004 can teach us about 2024 \u2014 and the librarians who keep getting the lesson wrong","tags":["Llm","Ai Search"],"title":"Learning from Google Scholar and why a tool does not need to be flawless to be useful","updated_at":1781542041,"url":"https://aarontay.substack.com/p/learning-from-google-scholar-and","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Dingemanse","given":"Mark"}],"blog":{"authors":null,"community_id":"ac7a6214-f166-416e-9500-caa8343d6285","created":1780876800,"current_feed_url":null,"description":"Sounding out ideas on language, interaction, and iconicity","favicon":"https://rogue-scholar.org/api/communities/ac7a6214-f166-416e-9500-caa8343d6285/logo","feed_format":"application/atom+xml","feed_url":"https://ideophone.org/feed/atom/","filter":"category:98","generator":"WordPress","home_page_url":"https://ideophone.org","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"ideophone","status":"active","subfield":"1203","title":"The Ideophone","updated":1781539498,"use_api":true},"blog_name":"The Ideophone","blog_slug":"ideophone","content_html":"<p>Note to readers: some of these ideas made it into a commentary I wrote with Christine Cuskley:</p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#bcd9e670\">Dingemanse, Mark &amp; Cuskley, Christine (in press). For robust research, center values, not technology. <em>Behavioral and Brain Sciences</em>. Preprint doi: <a href=\"https://doi.org/10.5281/zenodo.18944023\" target=\"_blank\" rel=\"noreferrer noopener\">10.5281/zenodo.18944023</a></p>\n\n\n\n<p>One topic that often comes up when discussing <a href=\"https://ideophone.org/generative-ai-and-research-integrity/\" data-type=\"post\" data-id=\"8271\">LLM technology in relation to research integrity</a> is one that I will describe as <em>seeking permission</em>. When looking at the ethical, legal, and societal harms imposed by LLMs (<a href=\"https://hcommons.org/?get_group_doc=1005140/1757881623-Guest_etal_2025.pdf\">and there are many</a>), sometimes people feel the message ends up altogether too negative. How about this use case I heard of? Aren&#8217;t some people getting something useful out of it? Surely we can&#8217;t ban the tech outright?</p>\n\n\n\n<p>Often this is phrased as a concern about messaging (people won&#8217;t accept it if we tell them how bad it is; you need to sugarcoat it by also mentioning something nice). Sometimes it is phrased as a majority argument: it&#8217;s already here, everyone is using it, surely it can&#8217;t be that bad? (Smoking would like a word.) Sometimes it is a concern about missing the boat: these are skills we need, telling people not to use it is like telling them to go back to quill pens. <sup><a href=\"#footnote_1_8781\" id=\"identifier_1_8781\" class=\"footnote-link footnote-identifier-link\" title=\"Quick rhetorical intervention here: I don&rsquo;t hide my distaste of Big Tech LLMs, but I rarely suggest to ban them, so when I get this quip I typically bounce it back: what made you think I suggested that? There&rsquo;s a good conversation to be had here, but it is not about quill pens. It is about the moral distress that a response like this reveals.\">1</a></sup></p>\n\n\n\n<p>What I think is going on in these kinds of cases is that the central thrust of the argument is being missed. <strong>When it comes to research integrity, the key is a values-first perspective rather than a tech-first perspective.</strong></p>\n\n\n\n<p>A values-first perspective asks: how can we best uphold the values and standards that make our research robust, reproducible and future proof? A tech-first perspective asks: yeah, but how can I use this technology? It puts technology above values. It seeks permission but sidesteps the question of values.</p>\n\n\n\n<h2 class=\"heading\" class=\"wp-block-heading\">Can I use an LLM for &#8230;</h2>\n\n\n\n<p>One example that came up in a <a href=\"https://ideophone.org/on-generative-ai-and-reproducibility/\" data-type=\"post\" data-id=\"8740\">recent session</a> started with <em>literature review</em>: surely it&#8217;s not too harmful, a questioner said, to use an LLM for a first stab at a literature review? Multiple participants pushed back on this, saying that actually, LLMs don&#8217;t reliably summarise. Also, they provide only the most average consensus view; we don&#8217;t know what&#8217;s being left out. And LLMs by nature regurgitate without understanding; can we actually identify confidently produced bullshit in a field that we don&#8217;t master fully? Further, reading is a hard-won skill: tracing arguments, spending time with papers, separating substance and rhetorics; surely we don&#8217;t want to lose this skillset.</p>\n\n\n\n<p>A retreat followed: well then, maybe literature review was not the best example, but surely there is some other kind of use that is okay? Someone mentioned programming. Folks brought up deskilling, the security risks of vibe-coding, technological debt. Oh well, that&#8217;s not the kind of programming I meant. And so on.</p>\n\n\n\n<p>We can play this game all day: <a href=\"https://anthonymoser.github.io/writing/ai/haterdom/2025/08/26/i-am-an-ai-hater.html\">seek permission</a> for an edge case, retreat to another case when trouble arises, appeal to amiability or tolerance. Two things to note when a conversation gets to this point. First, it&#8217;s a misrepresentation: reluctance to grant permission is portrayed as an intolerant move, a wish to take the other&#8217;s toys, when it is nothing like that. What do you need my permission for, anyway? Second, it&#8217;s a distraction: it moves the conversation away from values and towards a tech-first perspective.</p>\n\n\n\n<h2 class=\"heading\" class=\"wp-block-heading\">Values: a matrix for mindful choice</h2>\n\n\n\n<p>This is where a values-first perspective can help cut through the knots. The approach is not to recommend or forbid particular tech products, or particular uses; it is to provide a matrix for mindful choice. For any attested or conceivable use, you can ask: how would this help or hinder my upholding of high standards of research integrity? If we consider the pushback voiced in response to the &#8216;literature review&#8217; use case, you can see they appeal to the same core values:</p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>honesty</em> (reporting processes accurately and being open about margins of uncertainty)</li>\n\n\n\n<li><em>scrupulousness</em> (being precise and thoughtful)</li>\n\n\n\n<li><em>transparency </em>(showing one's process and allowing others to build on it)</li>\n\n\n\n<li><em>independence</em> (being impartial and unswayed by commercial or political interests) </li>\n\n\n\n<li><em>responsibility</em> (taking into account the environment; accepting accountability for the statements made)</li>\n</ul>\n\n\n\n<p>These values do not come out of thin air; they&#8217;re straight from a widely adopted code of conduct for research integrity (NCCRI, 2018). They are also not controversial; most scientists will recognise them as principles that characterize robust research. I&#8217;ve written about them before, e.g. in my <a href=\"https://ideophone.org/generative-ai-and-research-integrity/\" data-type=\"post\" data-id=\"8271\">guidance on GenAI</a> and in my post on why <a href=\"https://ideophone.org/why-synthetic-text-is-incompatible-with-science-blogging/\" data-type=\"post\" data-id=\"8516\">synthetic text has no place in science blogging</a>.</p>\n\n\n\n<p>If you use these values as a compass to steer by, it&#8217;s easier to navigate the landscape of technology use. On the other hand, if you find yourself seeking permission, one useful thing to do is to step back and inspect the underlying value conflict. </p>\n\n\n\n<p>As you move from a tech-first to a values-first perspective, the question shifts from &#8220;won&#8217;t you give me permission?&#8221; to &#8220;how do I do the best science possible?&#8221;. And that, to me, is a question worth asking.</p>\n\n\n\n<h2 class=\"heading\" class=\"wp-block-heading\">Further reading</h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https://anthonymoser.github.io/writing/ai/haterdom/2025/08/26/i-am-an-ai-hater.html\">I am an AI Hater</a>, by Anthony Moser</li>\n\n\n\n<li><a href=\"https://zenodo.org/records/17065099\">Against the Uncritical Adoption of &#8216;AI&#8217; technologies in academia</a>, by Olivia Guest and collaborators</li>\n\n\n\n<li><a href=\"https://www.nwo.nl/en/netherlands-code-of-conduct-for-research-integrity\">Netherlands Code of Conduct for Research Integrity, 2018</a></li>\n</ul>\n\n\n\n<p></p>\n<ol class=\"footnotes\"><li id=\"footnote_1_8781\" class=\"footnote\">Quick rhetorical intervention here: I don&#8217;t hide my distaste of Big Tech LLMs, but I rarely suggest to ban them, so when I get this quip I typically bounce it back: what made you think I suggested that? There&#8217;s a good conversation to be had here, but it is not about quill pens. It is about the moral distress that a response like this reveals.<span class=\"footnote-back-link-wrapper\">[<a href=\"#identifier_1_8781\" class=\"footnote-link footnote-back-link\">&#8617;</a>]</span></li></ol>","doi":"https://doi.org/10.59350/cwfrp-k7f11","guid":"https://ideophone.org/?p=8781","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1765584000,"rid":"1tyst-nr357","summary":"Note to readers: some of these ideas made it into a commentary I wrote with Christine Cuskley: Dingemanse, Mark &amp; Cuskley, Christine (in press). For robust research, center values, not technology.","tags":["Academia","Most Read","Writing","Generative AI"],"title":"Don't seek permission, center values","updated_at":1781541999,"url":"https://ideophone.org/dont-seek-permission-center-values/","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>In language, semantics describe the names and meanings of words. The\nbioinformatics community has aptly adopted <em>biosemantics</em> as a concept that\nencompasses the issues with the names and meanings of biological entities,\nusually in natural language processing and data integration. However, semantics\ndoes not capture the context of words, and <em>biosemantics</em> fails to describe the\nbiological context and complex relationships between biological entities.</p>\n<p><img alt=\"Semantics versus Pragmatics\" height=\"300px\" src=\"https://pediaa.com/wp-content/uploads/2018/08/Difference-Between-Semantics-and-Pragmatics_Figure-1.png\"/></p>\n<p>Pragmatics goes beyond semantics and describes the context of words. Because of\nthis parallelism, I've begun to use the term <em>biopragmatics</em> to describe the\nfamily of computational approaches aimed at identifying and contextualizing the\ncontext of biological entities.</p>","doi":"https://doi.org/10.59350/j4vty-rrr02","guid":"https://cthoyt.com/2020/01/22/biosemantics-versus-biopragmatics","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1579651200,"rid":"6qadr-jvy42","summary":"In language, semantics describe the names and meanings of words. The bioinformatics community has aptly adopted biosemantics as a concept that encompasses the issues with the names and meanings of biological entities, usually in natural language processing and data integration. However, semantics does not capture the context of words, and biosemantics fails to describe the biological context and complex relationships between biological entities.","tags":["Semantics","Meta"],"title":"Biosemantics vs. Biopragmatics","updated_at":1781539932,"url":"https://cthoyt.com/2020/01/22/biosemantics-versus-biopragmatics.html","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>How many molecular biology papers have you read today? This week? This month? If\nyou're like me, its not so many, and we're falling behind very quickly. Here's a\nchart made by the <em>new</em> PubMed that summarizes how many papers were published\nmentioning RAS in the last 50 years.</p>\n<p><img alt=\"RAS Histogram\" src=\"https://cthoyt.com/img/ras_pubmed_history.png\"/></p>\n<p>There were 4,483 publications listed in 2019. We can't read that much, and even\nif we did, we couldn't remember it all. That's why we need to take the knowledge\nout of the unstructured text and store it in a structured form that can be read\nand stored in computers. This way, we can easily share it, query it, and write\nalgorithms that can help us reason about the incredible amount of biological\nknowledge out there.</p>\n<p>There are several formats in which this kind of information can be stored on a\ncontinuum between directly representing mechanistic biology to representing the\nknowledge itself. In the popular middle ground are BioPAX and BEL, which I'll\ncome back to in future posts.</p>\n<p>It's important to keep in mind that knowledge needs to be curated - this can\neither be manual, through natural language processing, or a mixture of both.\nI've written\n<a href=\"https://academic.oup.com/database/article/doi/10.1093/database/baz068/5521414\">a paper</a>\non such a process, but for now this post should motivate a few following ones\ndescribing what it takes to deal with nomenclature, build ontologies, and then\nstart extracting mechanistic biology from the literature.</p>","doi":"https://doi.org/10.59350/jt101-b1374","guid":"https://cthoyt.com/2020/01/23/encoding-biology-in-kgs","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1579737600,"rid":"qyhqj-cs605","summary":"How many molecular biology papers have you read today? This week? This month? If you're like me, its not so many, and we're falling behind very quickly. Here's a chart made by the new PubMed that summarizes how many papers were published mentioning RAS in the last 50 years.","tags":["Knowledge Graphs"],"title":"Encoding Biology in Knowledge Graphs","updated_at":1781539915,"url":"https://cthoyt.com/2020/01/23/encoding-biology-in-kgs.html","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>The other day I saw a tweet lamenting the drag that is literature review during\npreparation for writing your thesis.</p>\n<blockquote class=\"twitter-tweet\" data-partner=\"tweetdeck\"><p dir=\"ltr\" lang=\"en\">I just love writing 15 page literature reviews for graduate school courses on literally any topic except my thesis topic.</p>\u2014 PhD Diaries (@thoughtsofaphd) <a href=\"https://twitter.com/thoughtsofaphd/status/1225762592045649920?ref_src=twsrc%5Etfw\">February 7, 2020</a></blockquote>\n<p>I agree. I felt the same pain last fall when I wrote\n<a href=\"https://github.com/cthoyt/doctoral-thesis\">my doctoral thesis</a>. Luckily, I had\na strategy that made it a bit easier.</p>\n<p>I learned it from one of my professors when I was doing my master's degree in\nLife Science Informatics. Each semester, we had a seminar course in which each\nstudent was assigned research articles to read and present to the class with a\nshort slide deck. Later, I joined his research group and realized that this\ncourse served as a literature review for him just as much as us.</p>\n<p>So later when I was a Ph.D. student, I volunteered to run the seminar. I\nco-opted the concept, and planned the course to cover many of the topics I found\ninteresting for my thesis. I already knew some of the papers very well, and a\nfew were ones I had always been meaning to read. I tried to pick the most recent\npapers for topics when possible, but also threw in a few classics as well.</p>\n<p>On the first day of the seminar, I shared the following course information. I\nthought it was important to make clear what my expectations were for students in\nterms of their prior knowledge. Since they all came from the same master's\nprogram, I thought it was enough that they had passed one of the first semester\nlectures called \"Biological Databases\" which was about many of the resources and\ndatabases used in the systems and networks biology community. I also outlined\nwhat was the content for the course, what was expected, etc. then shared this\nall as a Google Doc so they could read it over and add comments.</p>\n<p>I also made a list of possible papers and a tentative schedule that students\ncould look over and decide which papers they found most interesting. The topics\nwere arranged in a logical order to tell the story of my thesis, and for each\nsection there were a few papers that I thought were very important, and a few\nextras just in case there was a lot of interest. During the first day of the\nseminar, I also went through the list of all papers and explained the topics to\nthe students. I gave them this list via Google Docs as well, and they were able\nto claim papers for their presentations. Below, I've listed the final list of\npapers and the order in which they were presented. We were able to come to\nagreements for all students to present the papers I found most important. Maybe\n40% of the class found a paper interesting and picked one the first day and the\nrest took the next week to decide, ask questions, or propose new papers.</p>\n<p>Another consideration I had when picking this paper list was to choose work done\nby my colleagues that I found interesting and helpful. After, I invited them to\ncome listen to the seminar and mediate discussion after. We were able to invite\none of my collaborators Mehdi Ali (he's a really good guy!) to discuss his work\non using deep learning for relation extraction in natural language processing. I\nthink that might have been the most engaging day of the whole series.</p>\n<p>I added one aspect to this course compared to the previous seminar that I had\nattended: each student was not only responsible for presenting the paper that\nhad been assigned from my list, but they were also responsible for finding a\nrelevant pre-print (in the same or similar topic) and submitting a peer review\nthrough the pre-print system. When I was a student, I noticed many students did\nnot read the references of the paper they were assigned in our seminars, and\nalso had not considered other similar research to their paper. Asking them to\nfind their own papers was a way to make this a more creative and fun process,\nand would directly prepare them to answer questions at the end of the\npresentation like \"what will the authors do next?\" or \"how will this research be\nused by others?\"</p>\n<p>One of the funny things that happened during the pre-print presentations is the\nstudents found several of mine and presented those. I suppose this was\ninevitable given the contemporary nature of my work in the context of the topics\nchosen. I would actually explicitly encourage students to check out my\npre-prints the next time I host a seminar, because I know the work very well and\ncould mediate a nice discussion.</p>\n<p>I learned a lot through the process of preparing this seminar. Its outline\nbecame the outline for my thesis, and a lot of the discussions became points\nthat I addressed explicitly in my writing. I wouldn't say that I was taking\nadvantage of the students in this process - we all benefited from the\nexperience. I hope you get some ideas about how you might be able to do this\nyourself, whether you're a doctoral student, a postdoc, or something else!</p>\n<h2 id=\"course-information\">Course Information</h2>\n<ul>\n<li>Title: Knowledge Assembly, Data Integration, and Modeling in Systems and\nNetworks Biology</li>\n<li>Period: Winter Semester 2018/2019</li>\n<li>Location: Endenicher Allee 19A, Room U.105 on Wednesdays 13.00-14.30</li>\n</ul>\n<h3 id=\"qualifications\">Qualifications</h3>\n<p>Students should be comfortable with the material presented in the Biological\nDatabases lecture during the first semester of the LSI curriculum.</p>\n<h3 id=\"goal\">Goal</h3>\n<p>Students will have the opportunity to practice reading, presenting, and\ndiscussing recent biomedical literature on the topics of knowledge assembly,\ndata integration, and modeling in systems and networks biology.</p>\n<h3 id=\"content\">Content</h3>\n<p>Students will be assigned papers and present on the holistic process of\nknowledge discovery in systems and networks biology that focus on the topics of\nknowledge assembly (e.g., natural language processing, modeling formalisms and\nformats, reasoning techniques), data integration (e.g., practical scenarios\nfocusing on techniques on the data level, knowledge level, and analytical\nlevels), and modeling strategies (e.g., rule-based modeling, agent-based\nmodeling, mathematical modeling, hypothesis generation with knowledge-based\napproaches).</p>\n<h3 id=\"assignment\">Assignment</h3>\n<p>Students will be assigned an article to read and present during a thirty (30)\nminute lecture. One goal of this lecture is to show an understanding of not only\nthe material presented in the article, but also the relevant background\ninformation - this may entail following the references and reading other\narticles. Another goal is to not only educate, but entertain the audience.\nStudents will also be expected to find a relevant pre-print article on arXiv,\nbioRxiv, or other pre-print server and post a peer-review for the author on the\ncorresponding service. Following the presentation of their assigned article,\nstudents should include slides (1-3) briefly explaining the relevance of the\npre-print that they found.</p>\n<h2 id=\"method-of-performance-review\">Method of Performance Review</h2>\n<p>Students will be assessed on the understanding of their assigned topic, the\nquality of their presentation, and their participation. Students missing more\nthan 2 seminars will not pass the course without a doctor's note.</p>\n<h2 id=\"schedule\">Schedule</h2>\n<h3 id=\"week-0---october-10th-2018---syllabus-week\">Week 0 - October 10th, 2018 - Syllabus Week</h3>\n<p>This week there will a short discussion of the syllabus and no presentation. For\nthose in Bonn that aren't aware of this wonderful tradition, welcome to Syllabus\nWeek.</p>\n<h3 id=\"week-1---october-31st-2018---named-entity-recognition\">Week 1 - October 31st, 2018 - Named Entity Recognition</h3>\n<p>Mubassher Leser, U., &amp; Hakenberg, J. (2005).\n<a href=\"https://doi.org/10.1093/bib/6.4.357\">What makes a gene name? Named entity recognition in the biomedical literature</a>.\nBriefings in Bioinformatics, 6(4), 357\u2013369.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2017/03/08/115022</p>\n<p>Bachman, J. A., Gyori, B. M., &amp; Sorger, P. K. (2018).\n<a href=\"https://doi.org/10.1186/s12859-018-2211-5\">FamPlex: A resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining</a>.\n<em>BMC Bioinformatics</em>, 19(1), 1\u201314.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/07/29/379446</p>\n<h3 id=\"week-2---november-7th-2018---identifiers\">Week 2 - November 7th, 2018 - Identifiers</h3>\n<p>Laibe, C., &amp; Le Nov\u00e8re, N. (2007).\n<a href=\"https://doi.org/10.1186/1752-0509-1-58\">MIRIAM Resources: tools to generate and resolve robust cross-references in Systems' Biology</a>.\n<em>BMC Systems Biology</em>, 1, 58.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2017/07/24/167619</p>\n<p>Juty, N., Le Nover\u0300e, N., &amp; Laibe, C. (2012).\n<a href=\"https://doi.org/10.1093/nar/gkr1097\">Identifiers.org and MIRIAM Registry: Community resources to provide persistent identification</a>.\n<em>Nucleic Acids Research</em>, 40(D1), 580\u2013586.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/02/14/101279</p>\n<h3 id=\"week-3---november-14th-2018---information-extraction\">Week 3 - November 14th, 2018 - Information Extraction</h3>\n<p>Novichkova, S., <em>et al.</em> (2003).\n<a href=\"https://doi.org/10.1093/bioinformatics/btg207\">MedScan, a natural language processing engine for MEDLINE abstracts</a>.\n<em>Bioinformatics</em>, 19(13), 1699\u20131706.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/08/29/403667</p>\n<p>Ali, M., <em>et al.</em> (2017).\n<a href=\"http://publica.fraunhofer.de/eprints/urn_nbn_de_0011-n-4972978.pdf\">Automatic Extraction of BEL-Statements based on Neural Networks</a>.\n<em>Proceedings of BioCreative VI Challenge and Workshop</em>, (October).</p>\n<p>Pre-print : https://osf.io/j76y3/</p>\n<h3 id=\"week-4---november-21nd-2018---knowledge-representations\">Week 4 - November 21nd, 2018 - Knowledge Representations</h3>\n<p>Demir, E., <em>et al.</em> (2010).\n<a href=\"https://doi.org/10.1038/nbt1210-1308c\">The BioPAX community standard for pathway data sharing</a>.\n<em>Nature Biotechnology</em>, 28(12), 1308\u20131308.</p>\n<p>Pre-print: https://www.biorxiv.org/content/10.1101/192856v1</p>\n<p>Hucka, M., <em>et al.</em> (2003).\n<a href=\"http://www.ncbi.nlm.nih.gov/pubmed/12611808\">The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models</a>.\n<em>Bioinformatics (Oxford, England)</em>, 19(4), 524\u201331.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/01/23/246470</p>\n<h3 id=\"week-5---november-28th---knowledge-representations-cont\">Week 5 - November 28th - Knowledge Representations (cont\u2026)</h3>\n<p>Le Nov\u00e8re, <em>et al.</em> (2009).\n<a href=\"https://doi.org/10.1038/nbt.1558\">The Systems Biology Graphical Notation</a>.\n<em>Nature Biotechnology</em>, 27(8), 735\u201341.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/01/30/256750</p>\n<p>Carbon, S., <em>et al.</em> (2017).\n<a href=\"https://doi.org/10.1093/nar/gkw1108\">Expansion of the gene ontology knowledgebase and resources: The gene ontology consortium</a>.\n<em>Nucleic Acids Research</em>, 45(D1), D331\u2013D338.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/10/07/437020</p>\n<h3 id=\"week-6---december-12th-2018---pathway-databases-and-semantic-data-integration\">Week 6 - December 12th, 2018 - Pathway Databases and Semantic Data Integration</h3>\n<p>Croft, D., <em>et al.</em> (2014).\n<a href=\"https://doi.org/10.1093/nar/gkt1102\">The Reactome pathway knowledgebase</a>.\n<em>Nucleic Acids Research</em>, 42(D1), D472\u2013D477. <strong>AND</strong> Fabregat, A., <em>et al.</em>\n(2018).\n<a href=\"https://doi.org/10.1093/nar/gkx1132\">The Reactome Pathway Knowledgebase</a>.\n<em>Nucleic Acids Research</em>, 46(D1), D649\u2013D655.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/10/09/375097</p>\n<p>Cerami, E. G., <em>et al.</em> (2011).\n<a href=\"https://doi.org/10.1093/nar/gkq1039\">Pathway Commons, a web resource for biological pathway data</a>.\n<em>Nucleic Acids Research</em>, 39(SUPPL. 1), 685\u2013690.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/09/03/353235</p>\n<p>Khatri, P., Sirota, M., &amp; Butte, A. J. (2012).\n<a href=\"https://doi.org/10.1371/journal.pcbi.1002375\">Ten years of pathway analysis: Current approaches and outstanding challenges</a>.\n<em>PLoS Computational Biology</em>, 8(2).</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/09/13/416131</p>\n<p>Gligorijevi\u0107, V., &amp; Pr\u017eulj, N. (2015).\n<a href=\"https://doi.org/10.1098/rsif.2015.0571\">Methods for biological data integration: perspectives and challenges</a>.\n<em>Journal of The Royal Society Interface</em>, 12(112), 20150571.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/12/13/358390</p>\n<h3 id=\"week-8---january-16th-2019---applications\">Week 8 - January 16th, 2019 - Applications</h3>\n<p>Saqi, M., <em>et al.</em> (2018).\n<a href=\"https://doi.org/10.1093/bib/bby025\">Navigating the disease landscape: knowledge representations for contextualizing molecular signatures</a>.\n<em>Briefings In Bioinformatics</em>, (May), 1\u201315.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/11/23/475202</p>\n<p>Himmelstein, D. S., <em>et al.</em> (2017).\n<a href=\"https://doi.org/10.7554/eLife.26726\">Systematic integration of biomedical knowledge prioritizes drugs for repurposing</a>.\n<em>ELife</em>, 6.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/10/13/442640</p>\n<h3 id=\"week-9---january-23rd-2019---applications\">Week 9 - January 23rd, 2019 - Applications</h3>\n<p>Lopez, C. F., <em>et al.</em> (2013).\n<a href=\"https://doi.org/10.1038/msb.2013.1\">Programming biological models in Python using PySB</a>.\n<em>Molecular Systems Biology</em>, 9(646), 646.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/12/23/503359</p>\n<p>Gyori, B. M., <em>et al.</em> (2017).\n<a href=\"https://doi.org/10.15252/msb.20177651\">From word models to executable models of signaling networks using automated assembly</a>.\n<em>Molecular Systems Biology, 13(11)</em>, 954.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/05/15/322156</p>","doi":"https://doi.org/10.59350/4t3sf-mab09","guid":"https://cthoyt.com/2020/02/09/seminar-for-thesis-writing","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1581206400,"rid":"9h7sp-08z86","summary":"The other day I saw a tweet lamenting the drag that is literature review during preparation for writing your thesis.","tags":["Doctoral Thesis","Teaching"],"title":"Host a Graduate Seminar Before Writing Your Thesis","updated_at":1781539913,"url":"https://cthoyt.com/2020/02/09/seminar-for-thesis-writing.html","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>We've all been there. You started a new branch from master. You had a very\nspecific goal in mind, <strong>The Original Goal</strong>. You made a pull request (PR) to go\nwith it, too, <strong>The Original Pull Request</strong>. But then, you had an idea! And\nalso, someone on your team asked you to solve another problem! Now the original\ncode you wrote to address <strong>The Original Goal</strong> relies on that code \u2026 and now\nyou've got dozens of files changed, hundreds of lines of diff, and nobody\n(including you) can understand what you've done. Like I said, we've all been\nthere. Here's what you can do to fix it:</p>\n<h2 id=\"1-stop-and-relax\">1. Stop and Relax</h2>\n<p>Don't do anything rash. Git is a pain to use, and you're going to have to rely\non it to keep a history for you of what you've done.</p>\n<h2 id=\"2-summarize\">2. Summarize</h2>\n<p>First, you're going to have to take a big step back. Write a summary of all the\nthings you've done in <strong>The Original Pull Request</strong>. This should be about <em>what</em>\nthe PR does and <em>why</em> it does it. Of course it could vary depending on the\nsituation, but this summary shouldn't be about exactly how the PR does it,\nbecause the implementation details are likely what lead to this situation in the\nfirst place.</p>\n<p>Keep in mind that every PR has a box at the top that's used to describe what's\nin it. This is where you will put your summary.</p>\n<h2 id=\"3-assessing-dependencies\">3. Assessing Dependencies</h2>\n<p>Of all the things that <strong>The Original Pull Request</strong>, some of them are\nself-contained, and some of them rely on each other. It was probably the case\nthat to accomplish <strong>The Original Goal</strong>, you had to address lots of smaller\ngoals. You probably also had to change lots of code and write new code too.</p>\n<p>Wouldn't it have been nice if all of these implementations were already done,\nbecause then you could have just solved <strong>The Original Goal</strong> directly by\nusing/applying previous code. That's what we're going to aim for.</p>\n<p>But first, you need to figure out which things you did relied on which other\nones, because you're going to break <strong>The Original Pull Request</strong> up until it\nexactly matches up to addressing <strong>The Original Goal</strong>. don't have any</p>\n<h2 id=\"4-the-break-up\">4. The Break Up</h2>\n<p>After you understand which parts of <strong>The Original Pull Request</strong> depend on each\nother, pick one independent part of the code that accomplishes one sub-goal.\nSince you're not doing this to be a martyr, and we all know git is too\ncomplicated to <em>Do It Right</em>, you're going to copy/paste the files that are\nrelated to this change to your desktop*.</p>\n<h2 id=\"5-escape-the-madness\">5. Escape the Madness</h2>\n<p>Before continuing, you're going to make sure all of the code in your big messy\nbranch for <strong>The Original Pull Request</strong> is committed and pushed. Even though we\nwant to supersede what's there, it never hurts to keep track of your descent\ninto madness.</p>\n<p>After there's nothing lying around, switch back to master. If your team has\ntaken good care of your repository, the master branch should be undisturbed by\nthe chaos you've created in <strong>The Original Pull Request</strong>. Make a new branch\nfrom master, and name it appropriately for fixing the one sub-goal, from here\nout known as <strong>The Sub-Goal</strong> that you identified in Step 4. Now you can start\nupdating the relevant files in your repository based on the files you copied to\nyour desktop. I suggest you don't copy/paste the contents of the whole files,\nbecause you might have forgotten about something else you changed in them. After\nall, you're reading my guide because this was a mess.</p>\n<h2 id=\"6-the-new-pull-request\">6. The New Pull Request</h2>\n<p>Once you've finished making the new branch for your independent part of code\nthat solves <strong>The Sub-Goal</strong>, you can make <strong>The New Pull Request</strong>.</p>\n<p>You will now go through the entire process of writing a good summary of this\nbranch for your co-developers, you will get their feedback, you will make\nupdates, pass flake8, and so on. They will thank you for having code that\naccomplishes one thing, and can be described simply. They will thank you for not\nhaving too big of a diff, and for the things in the diff all being relevant and\nimportant. Then you can merge this branch into master.</p>\n<h2 id=\"7-newfound-wisdom\">7. Newfound Wisdom</h2>\n<p>Throughout transferring the code for <strong>The New Pull Request</strong> you have probably\nrealized there are some things you did back in <strong>The Original Pull Request</strong>\nthat you could do better, and made some updates in the code in <strong>The New Pull\nRequest</strong> to reflect the wisdom you've gained along the way. That's great!\nCongratulations!</p>\n<p>After your team has approved <strong>The New Pull Request</strong>, you can merge it into\nmaster and both delete the branch locally and on the remote. Then you should\nswitch back to the master branch. You can pull from master, and see your code\nthat solved <strong>The Sub-Goal</strong> reflected here.</p>\n<h2 id=\"8-the-hard-part\">8. The Hard Part</h2>\n<p>This is the hard part. Now you have to switch back to the branch for <strong>The\nOriginal Pull Request</strong>. Now you have to update this branch from master. It's\ngoing to be hard because now you've probably made different changes in <strong>The New\nPull Request</strong> than in <strong>The Original Pull Request</strong> so there will likely be\nconflicts.</p>\n<p>This is not a tutorial on how to solve merge conflicts. Use google to figure\nthat out</p>\n<p>I can't understate: <strong>do this part really well</strong>. If you don't, then the history\nin the original branch will be even more incomprehensible, and you won't be able\nto tell if you lost any of your original work. Please, please, please do this\nwell.</p>\n<p>P.S. Like I said before, don't be a martyr. Use tools like GitHub Desktop and\nPyCharm to help you merge. I heard that the git CLI was <em>allegedly</em> created by\nLinus Torvalds to slow other developers down.</p>\n<p>Why are we going through all of this pain, rather than just pushing your team to\nlet you merge <strong>The Original Pull Request</strong>? The reason you have to do this is\nbecause now all of the changes that addressed <strong>The Sub-Goal</strong> are part of\nmaster, and are no longer part of the diff of <strong>The Original Pull Request</strong>.</p>\n<p>Now you're one step closer to your team being able to understand, review, and\neventually merge <strong>The Original Pull Request</strong>.</p>\n<h2 id=\"9-the-frustrating-part\">9. The Frustrating Part</h2>\n<p>This is the frustrating part. After you've gone through all of that work to\nsplit a tiny part of <strong>The Original Pull Request</strong> into a smaller, independent\npull request, you're not done. You will probably have to repeat steps 4-8 a few\ntimes. You'll be tempted to throw away the branch for <strong>The Original Pull\nRequest</strong> and maybe start over.</p>\n<p>Don't do that.</p>\n<p>If you do, the same disorganization that lead to the mess of <strong>The Original Pull\nRequest</strong> might just slip back into whatever you do next. Even worse, nobody\nelse will be able to follow what you've done until now.</p>\n<p>So relax. This is going to take a few days. You're going to have to wait in\nbetween several iterations for feedback. That's good. You need feedback. I need\nfeedback. We all need to practice getting it and giving it. Embrace the\nopportunity to have your team help you improve your code, gain wisdom, and make\nyour contributions sustainable.</p>\n<h2 id=\"finishing-up\">Finishing Up</h2>\n<p>Eventually after several iterations of 4-9, you will have excised all of the\ncode that was important for <strong>The Original Pull Request</strong>, but not directly\naccomplishing <strong>The Original Goal</strong>. As you removed independent parts, new parts\nbecame independent themselves. Eventually, <strong>The Original Pull Request</strong> will\nindeed match up exactly to <strong>The Original Goal</strong>, then you will be able to come\nback to it for review and merging.</p>\n<p>I understand this is a frustrating process. The purpose of these steps were to\nhelp you think through a large piece of work you've done. You should be proud\nthat you've solved a complex problem with many intricate parts. It was a lot of\nextra work to break it into many pull requests, and it might have taken more of\nyour time the first time working through this process, but in the future, this\nmight help you to start with small tasks rather than addressing <strong>The Original\nGoal</strong> all at once. GitHub, for example, has an issue tracker that is very\nhelpful for this. I imagine that each issue should correspond to a <strong>Sub-Goal</strong>,\nand that each should have exactly one PR that addresses it. <strong>The Original\nGoal</strong> also deserves its own issue that points to all of the issues for its\nsub-goals. Eventually you will address this with a beautiful PR as well. Happy\ncoding!</p>\n<p>*If you're thinking, why don't I use cherry picking? If you know what cherry\npicking is in the context of git (and also how to use it) then you probably\nwon't have the issue that prompted this blog post. But also, you should go\noutside and pick some apples instead. Thanksgiving is never more than a few\nhundred days away. It pays to be ready.</p>\n<h2 id=\"afterword\">Afterword</h2>\n<p>It might be illustrative to see where an example of where this was done in\npractice, so I'll share some work I did with a text mining tool from Harvard\nMedical School, <a href=\"https://github.com/indralab/gilda\">Gilda</a>. It's a simple yet\npowerful system for grounding of named entities based on dictionary lookup.\nUnfortunately, it didn't include some dictionaries I wanted, and it didn't have\na UI to go with its web API.</p>\n<p>So I set out on figuring out how it generated dictionaries, where it stored\nthem, and how it loaded them to make the web app. I ended up making several\nmodifications to accomplish this goal, but it was a huge PR. I've definitely\nannoyed the author, <a href=\"https://github.com/bgyori\">@bgyori</a>, with PRs that are too\nbig before, which he was ultimately not able to understand or merge.</p>\n<p>Keep in mind, in your team, your teammates might be obligated to help you\nbecause you're working towards a common goal, getting paid, etc. When you're in\nthe open source world, nobody really owes you anything, so you it's in your best\ninterest to make things as easy as possible on the package's maintainer(s).</p>\n<p>So I made a few different pull requests that were all totally independent:</p>\n<ul>\n<li>Add constants for resource file paths\n<a href=\"https://github.com/indralab/gilda/pull/12\">#12</a></li>\n<li>Make API more reusable <a href=\"https://github.com/indralab/gilda/pull/13\">#13</a></li>\n<li>Make instantiation of Grounder more flexible\n<a href=\"https://github.com/indralab/gilda/pull/15\">#15</a></li>\n</ul>\n<p>Maybe you're seeing a theme here. I was improving lots of different bits of\nGilda so I could reuse the package in new code later. The next incremental\nincrease was:</p>\n<ul>\n<li>Refactor functionality from the GrounderInstance class into the Grounder class\n<a href=\"https://github.com/indralab/gilda/pull/16\">#16</a></li>\n</ul>\n<p>And finally with these in place, I realized that adding a web interface was\nparallel to my original goal, but not the core. What was really important was\nthat throughout all of the Gilda functionality, I could load my own synonym list\n(which I'd generate using the HPO, EFO, and DOID). I was able to address the UI\nwith:</p>\n<ul>\n<li>Add minimimal UI to web interface\n<a href=\"https://github.com/indralab/gilda/pull/19\">#19</a></li>\n</ul>\n<p>At the time of writing, we're still working through this PR. But all of it is\nleading up to the point where I can load my own files into this web interface.\nIt will seem so obvious to Ben when I send this PR next (but after giving him\nsome space\u2026 I did just bombard him with 5 PRs in a few days) what I am trying\nto accomplish and why.</p>\n<p>Want to see what happens when you try and do all of this in one PR? You will\ncorrectly guess that the PR is a total mess, impossible to understand, and\nriddled with questions that are really too big to answer when your head is\nalready so far in the sand. Behold, in all its infamy, my failed PR from last\nsummer (<a href=\"https://github.com/indralab/gilda/pull/4\">#4</a>). At this point, you\ncan't even see what a mess it was from the linked web page but if you go back\nthrough the version history before I broke it into 5 smaller PRs (using the\nworkflow described above) it was a monolith.</p>","doi":"https://doi.org/10.59350/3cn95-h9y94","guid":"https://cthoyt.com/2020/03/20/how-to-fix-your-monolithic-pull-request","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1584662400,"rid":"w8wqq-y9x71","summary":"We've all been there. You started a new branch from master. You had a very specific goal in mind, The Original Goal. You made a pull request (PR) to go with it, too, The Original Pull Request. But then, you had an idea! And also, someone on your team asked you to solve another problem!","tags":["Code With Me"],"title":"How to Fix Your Monolithic Pull Request","updated_at":1781539911,"url":"https://cthoyt.com/2020/03/20/how-to-fix-your-monolithic-pull-request.html","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>A few months ago, the question was posed on science Twitter: \"How many people\nhave published on <a href=\"https://chemrxiv.org/\">ChemRxiv</a>?\"</p>\n<blockquote class=\"twitter-tweet\" data-partner=\"tweetdeck\"><p dir=\"ltr\" lang=\"en\">makes me wonder about the stats at <a href=\"https://twitter.com/ChemRxiv?ref_src=twsrc%5Etfw\">@ChemRxiv</a> <a href=\"https://t.co/Ml5X8F4ckJ\">https://t.co/Ml5X8F4ckJ</a></p>\u2014 Egon Willigh\u24d0gen (@egonwillighagen) <a href=\"https://twitter.com/egonwillighagen/status/1219193083792969728?ref_src=twsrc%5Etfw\">January 20, 2020</a></blockquote>\n<p>It was a good day for me, which meant I was in the mood to take up the first\nchallenged posed on Twitter. I found that Fran\u00e7ois-Xavier Coudert\n(<a href=\"https://github.com/fxcoudert\">@fxcoudert</a>) has previously written a\n<a href=\"https://github.com/fxcoudert/tools/blob/master/chemRxiv/chemRxiv.py\">python client</a>\nfor ChemRxiv. I made a pair of pull requests\n(<a href=\"https://github.com/fxcoudert/tools/pull/9\">fxcoudert/tools#9</a> and\n<a href=\"https://github.com/fxcoudert/tools/pull/10\">fxcoudert/tools#10</a>) to fix some\nbugs and make it importable from other python modules.</p>\n<p>Unlike BioRxiv, the pre-print server for biology, ChemRxiv is implemented with\n<a href=\"https://figshare.com/\">FigShare</a>. It turns out that all FigShare \"institutions\"\nlike ChemRxiv are actually accessible through the main\n<a href=\"https://docs.figshare.com/\">FigShare API</a>. I think this is pretty cool, and\nmade sure that the ChemRxiv client that I had updated was actually able to be\nrun for any institution. Fun fact: the institution code for ChemRxiv is <code class=\"language-plaintext highlighter-rouge\">259</code>.</p>\n<p>I got to work writing my\n<a href=\"https://github.com/cthoyt/chemrxiv-summarize\">own repository</a> to wrap the\nclient, take care of downloading all of the bibliographic information available,\nand generating some pretty pictures. I originally ran the scripts and generated\npictures on January 20th, 2020 (the day Egon posed the question). Since the\npandemic has got the whole science community introspecting, I came back to this\ntoday and thought it might be worth writing up as a blog post.</p>\n<p>Without further ado, here are the most recent charts I've generated to answer\nthree main questions. I've linked the images in such a way that the charts will\nbe automatically updated with my GitHub repository. This also implicitly means\nthat there's a history of each image, but because two of them are plotting time\ncourse information, the history is already conveyed within the chart.</p>\n<h3 id=\"how-many-articles-were-contributed-each-month-to-chemrxiv\">How many articles were contributed each month to ChemRxiv?</h3>\n<p>How many papers were submitted each month to ChemRxiv? Keep in mind that the\ncurrent month is likely not complete.</p>\n<p><img alt=\"Articles per Month\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/articles_per_month.png\"/></p>\n<h3 id=\"how-many-unique-authors-contribute-each-month-to-chemrxiv\">How many unique authors contribute each month to ChemRxiv?</h3>\n<p>This only counts using the ORCID iDs of the first authors; it's pretty\ninconsistent what other identifying information is included in the metadata for\neach article.</p>\n<p><img alt=\"Unique Authors per Month\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/unique_authors_per_month.png\"/></p>\n<h3 id=\"how-many-author-submit-multiple-times-each-month\">How many author submit multiple times each month?</h3>\n<p>How many authors submitted more than once per month? This chart shows spikes in\nAugust, which I will guess is when most people are submitting before their\nsummer breaks :)</p>\n<p><img alt=\"Percent Duplicate Authors per Month\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/percent_duplicate_authors_per_month.png\"/></p>\n<h3 id=\"how-many-authors-submitted-for-their-first-time-each-month\">How many authors submitted for their first time each month?</h3>\n<p><img alt=\"First Time First Authors per Month\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/first_time_first_authors_per_month.png\"/></p>\n<h3 id=\"how-many-unique-first-authors-are-there-on-chemrxiv\">How many unique first authors are there on ChemRxiv?</h3>\n<p>How many first authors have historically contributed to ChemRxiv at each month?\nWe can take the first date of authorship for each author then count at each\nmonth how many unique first time authors there are. Then, we can use a\ncumulative sum to show how many authors have contributed to ChemRxiv at any\npoint in time.</p>\n<p><img alt=\"Historical Authorship\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/historical_authorship.png\"/></p>\n<h3 id=\"how-many-authors-are-prolific-on-chemrxiv\">How many authors are prolific on ChemRxiv?</h3>\n<p>If we aggregate the data, we can ask how many authors have submitted lots of\narticles:</p>\n<p><img alt=\"Author Prolificness\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/author_prolificness.png\"/></p>\n<h3 id=\"what-licenses-are-popular-on-chemrxiv\">What licenses are popular on ChemRxiv?</h3>\n<p>The following chart shows the popularity of different licenses over time. The\n<a href=\"https://creativecommons.org/licenses/by-nc-nd/4.0/\">CC BY-NC-ND 4.0 license</a> is\na resounding victor. You can learn about Creative Commons (CC) licenses\n<a href=\"https://creativecommons.org/licenses/\">here</a>.</p>\n<p><img alt=\"Historical Licenses\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/historical_licenses.png\"/></p>\n<p>If you're interested to regenerate these charts yourself, you're welcome to do\nso with the following code:</p>\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>git clone https://github.com/cthoyt/chemrxiv-summarize\n<span class=\"nb\">cd </span>chemrxiv-summarize\npython 01_download.py\npython 02_process.py\npython 03_visualize.py\n</code></pre></div></div>\n<p>Downloading takes a bit of time (about 40 minutes) but there's a <code class=\"language-plaintext highlighter-rouge\">tqdm</code> bar to\nkeep you entertained in the mean time. Normally I package all of my code, but\nthe one off scripts here didn't seem to warrant it.</p>\n<p>As a final note, I'd like to shout out to Marshall Brennan\n(<a href=\"https://twitter.com/Organometallica\">@Organometallica</a>) for being an excellent\nspokesperson and public face of ChemRxiv. Also, throughout this process I\nrealized he also was a chemistry major in his bachelor's at Northeastern\nUniversity like me. Go huskies!</p>\n<hr/>\n<p>May 2020 Update: Fran\u00e7ois-Xavier Coudert created the\n<a href=\"https://chemrxiv-dashboard.github.io/\">ChemRxiv-Dashboard</a>, which makes some\nsimilar summaries to this. Check it out!</p>\n<blockquote class=\"twitter-tweet\" data-partner=\"tweetdeck\"><p dir=\"ltr\" lang=\"en\">I made a dashboard for <a href=\"https://twitter.com/ChemRxiv?ref_src=twsrc%5Etfw\">@ChemRxiv</a>, fed by the <a href=\"https://twitter.com/figshare?ref_src=twsrc%5Etfw\">@figshare</a><br/>metadata API.<a href=\"https://t.co/rKyAOGkrVO\">https://t.co/rKyAOGkrVO</a> <a href=\"https://t.co/fLfjEabraz\">pic.twitter.com/fLfjEabraz</a></p>\u2014 FX Coudert (@fxcoudert) <a href=\"https://twitter.com/fxcoudert/status/1262763710956793860?ref_src=twsrc%5Etfw\">May 19, 2020</a></blockquote>\n<p>November 2020 Update: I added a license chart and made some changes to enable\nthis repo to be much more easily used for other FigShare institutions. If you've\nfound this post from @figshare's\n<a href=\"https://twitter.com/figshare/status/1323762002293121025\">tweet</a> and want help\nmaking these charts for your FigShare institution, please feel free to @ me on\nTwitter or send me an email.</p>","doi":"https://doi.org/10.59350/n7rjr-90f02","guid":"https://cthoyt.com/2020/04/15/summarizing-chemrxiv","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1586908800,"rid":"3ky5b-cne42","summary":"A few months ago, the question was posed on science Twitter: \"How many people have published on ChemRxiv?\"","tags":["Bibliometrics"],"title":"Summarizing ChemRxiv","updated_at":1781539909,"url":"https://cthoyt.com/2020/04/15/summarizing-chemrxiv.html","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>We have a big problem in the bioinformatics community with namespaces,\nidentifiers, and names. And nobody's posed the question better than\n<a href=\"https://www.youtube.com/watch?v=U0CGsw6h60k\">Rihanna herself</a>.</p>\n<p>During my Ph.D. at Fraunhofer, one of the old text miners reminisced to me about\nthe late 90's and early naughties when they had to curate their own dictionaries\nof synonyms for entities. I was lucky enough to have joined the bioinformatics\ncommunity after excellent nomenclature resources like\n<a href=\"https://www.ebi.ac.uk/chebi/\">ChEBI</a> and the <a href=\"https://www.genenames.org/\">HGNC</a>\nwere established and accepted by the community as gospel.</p>\n<p>I consider these sources excellent because it's quite easy to get a list of the\nidentifiers and corresponding names that they maintain (TSV, etc.). There are\nother nomenclatures, like the\n<a href=\"https://cthoyt.com/2020/04/18/ooh-na-na.html/ftp://ftp.expasy.org/databases/enzyme/enzyme.dat\">ExPASy Enzyme Classes</a>, that\nare stored as text files in non-standard formats.</p>\n<p>The Open Biomedical Ontology (OBO) format and\n<a href=\"http://www.obofoundry.org/\">OBO Foundry</a> were first published in\n<a href=\"https://www.nature.com/articles/nbt1346\">2007</a> as a solution for standardizing\na growing set of biomedical ontologies that few shared semantics. Many ontology\nmaintainers adopted their format, or at least used the OWL to OBO converter\ntools to include their ontologies in a reusable format. However, there remain\nsome notable holdouts like the\n<a href=\"https://github.com/CLO-ontology\">Cell Line Ontology</a> that have not begun to\ndistribute their content as OBO.</p>\n<p>In parallel, the <a href=\"https://www.ebi.ac.uk/ols\">Ontology Lookup Service (OLS)</a> was\npublished as one of many front-ends for exploring this growing list of\nresources. In comparison, it may have been one of the first tools to provide a\nnice user experience that included a search engine (powered by\n<a href=\"http://www.obofoundry.org/\">solr</a>, because they're living in the Java world).</p>\n<p>Both are lacking - there does not exist a solid OBO ecosystem (though Martin\nLarralde's <a href=\"https://github.com/althonos/pronto\">pronto</a> may well soon change\nthat) and even worse, the content in OBO loosely follows the standard, at best.\nOn the other hand, the OLS has both an over-engineered interface that isn't\nquite user friendly. For example, if you want to look up programmed cell death\n(GO:0012501), you have to know the internal OLS key for the namespace and the\nPURL for the identifier, which is not so obvious. Then you can finally hit the\n<a href=\"https://www.ebi.ac.uk/ols/api/ontologies/go/terms?iri=http://purl.obolibrary.org/obo/GO_0012501\">API</a>.</p>\n<p>And still, both of them lack some of my favorite, and arguably most important\nnamespaces, like HGNC, RGD, MGI, UniProt, Entrez Gene, and PubChem. As an aside,\ndealing with PubChem is for people operating on a whole different level, so I'm\nnot blaming anyone for dropping the ball on that one. Later, I will confess to\ndoing the same.</p>\n<p>Even worse, the OBO Foundry and OLS can't even agree on what to call some\nnamespaces. A great example is the NCBI taxonomy database. On the NCBI site,\nthey say that the namespace is called <code class=\"language-plaintext highlighter-rouge\">NCBI</code> and compact uniform identifiers\n(CURIEs) should look like <code class=\"language-plaintext highlighter-rouge\">NCBI:txid175694</code>, OBO Foundry says the namespace is\n<code class=\"language-plaintext highlighter-rouge\">NCBITaxon</code> (one of the few notable mixed-case namespace names) and CURIEs\nshould look like <code class=\"language-plaintext highlighter-rouge\">NCBITaxon:175694</code>.</p>\n<p>Identifiers.org came along to solve some of these ambiguities with a curated\ndatabase, but it's missing lots of the things in OBO Foundry and OLS, and it\neven disagrees on others. They call the NCBI taxonomy namespace <code class=\"language-plaintext highlighter-rouge\">taxonomy</code> and\nsay that identifiers should look like <code class=\"language-plaintext highlighter-rouge\">taxonomy:175694</code>. Exhausting!</p>\n<p><img alt=\"Registry Comparison\" src=\"https://cthoyt.com/img/registry_comparison.svg\"/></p>\n<p>One more issue is the GOGO problem. Many OBO ontologies use local identifiers\nthat also include the prefix because a given ontology might contain terms\nimported from other ones. However, this means that ontologies that originated\nfrom the OBO world have redundant identifiers, like from GO (e.g.,\nGO:GO:0012501). I know what you're wondering: is Dr. Claw in charge? Maybe.</p>\n<hr/>\n<p>The reason I went down this rabbit hole is because I want to support people to\ndo better curation. This means I want them to use identifiers instead of ever\nchanging names. For example, it turns out the half life of an HGNC gene symbol\nis very short -\n<a href=\"https://github.com/bio2bel/bio2bel-notebooks/blob/master/gene_symbol_half_life.ipynb\">thousands of them change every year</a>.\nHowever, if I want people to use identifiers instead of names in their\ndatabases, their papers, and other writing, there need to be really good tools\nfor looking up the names that go with each identifier and the cross-references\n(equivalences) to other databases that are talking about the same thing.</p>\n<p>So I built <a href=\"https://github.com/pyobo/pyobo\">PyOBO</a>. It includes tools for\nreading the OBO Foundry and getting all of the OBO resources that are available\n(as well as <em>many</em> manual fixes for incorrect metadata), it uses Daniel\nHimmelstein's <a href=\"https://github.com/dhimmel/obonet/\">Obonet</a> for parsing and\nstoring pre-parsed files for fast loading, and it applies a swath of rule-based\nnormalization that I've\n<a href=\"https://github.com/pyobo/pyobo/blob/master/src/pyobo/registries/metaregistry.json\">manually curated</a>\nby personally reading all of the OBO files, their identifiers, their\ncross-references, relationships, properties, and everything else. When it comes\nto data, there really is no way around getting your hands dirty.</p>\n<p>I also went ahead and\n<a href=\"https://github.com/pyobo/pyobo/tree/master/src/pyobo/sources\">wrote parsers and converters</a>\nfor lots of other databases like Entrez, ComplexPortal, InterPro, and others so\nthey could play nice with the rest of the ecosystem. Of course, this is an\nongoing process. There are always more databases to include, and when it comes\nto super-sized ones like PubChem, the paradigms I used might not hold up anymore\n(though I did write parser/converter for it and you're welcome to use it).</p>\n<p>After this long journey of a blog post, I think we're ready to address Rihanna's\nperrenial question: what's my name? Until now, there really didn't exist a\nservice that let you look up the name for an entity by its CURIE. The link I\ngave for the OLS is the closest I have found, and that just doesn't cut it.</p>\n<p>After all of this coding, I wrote a script (just run <code class=\"language-plaintext highlighter-rouge\">obo ooh-na-na</code>) that takes\nall of the available sources, normalizes their namespaces, normalizes their\nidentifiers, and dumps them as a big 'ol TSV file. 3 columns - namespace,\nidentifier, and name. No nonsense. Probably legal! Get it at\n<a href=\"https://doi.org/10.5281/zenodo.3756206\"><img alt=\"DOI\" src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.3756206.svg\"/></a>.\nI'll make updates periodically as I add more sources, such as if/when I feel\ncomfortable with including the PubChem dump - the\n<a href=\"https://cthoyt.com/2020/04/18/ooh-na-na.html/ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Monthly/2020-04-01/Extras/CID-Title.gz\">CID-Title.gz</a>\nfile is about 1.3 gigabytes, which means this will significantly increase the\nsize, but not so much that it's unreasonable.</p>\n<p>I can imagine that most people probably won't want to download this file, or\nload it in memory (un-gzipped) every time they want to use it. I wrote a simple\nweb service that wraps this dataset\n<a href=\"https://github.com/pyobo/pyobo/blob/master/src/pyobo/apps/resolver.py\">included in PyOBO</a>.\nIt should be as easy as running with the shell with\n<code class=\"language-plaintext highlighter-rouge\">python -m pyobo.apps.resolver</code> then running the following python code:</p>\n<div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"kn\">import</span> <span class=\"nn\">requests</span>\n\n<span class=\"c1\"># This is an exact match\n</span><span class=\"n\">successful_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/DOID:14330'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"identifier\": \"14330\", \"name\": \"Parkinson's disease\", \"prefix\": \"doid\", \"query\": \"DOID:14330\", \"success\": True}\n</span>\n<span class=\"c1\"># This one remaps the prefix if you get it slightly wrong\n</span><span class=\"n\">successful_remapped_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/DO:14330'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"identifier\": \"14330\", \"name\": \"Parkinson's disease\", \"prefix\": \"doid\", \"query\": \"DO:14330\", \"success\": True}\n</span>\n<span class=\"c1\"># This one can't find the identifier.\n</span><span class=\"n\">unsuccessful_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/DO:00000'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"identifier\": \"00000\", \"message\": \"Could not look up identifier\", \"prefix\": \"doid\", \"query\": \"DO:00000\", \"success\": False}\n</span>\n<span class=\"c1\"># Keep in mind, the point of this service isn't to validate identifiers.\n</span><span class=\"n\">unsuccessful_crazy_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/DO:thisIsNotRightAtAll'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"identifier\": \"thisIsNotRightAtAll\", \"message\": \"Could not look up identifier\", \"prefix\": \"doid\", \"query\": \"DO:thisIsNotRightAtAll\", \"success\": False}\n</span>\n<span class=\"c1\"># No mercy for bad prefixes\n</span><span class=\"n\">unsuccessful_prefix_lookup</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/notanamespace:0000'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"message\": \"Could not identify prefix\", \"query\": \"notanamespace:0000\", \"success\": False}\n</span></code></pre></div></div>\n<p>It's especially important that the service normalizes curies first, so both\n<code class=\"language-plaintext highlighter-rouge\">DOID:14330</code>, <code class=\"language-plaintext highlighter-rouge\">doid:14330</code>, and <code class=\"language-plaintext highlighter-rouge\">DO:14330</code> can all be resolved to their name,\n<em>Parkinson's disease</em>. Because I did extensive manual curation of namespaces and\ntheir synonyms, <code class=\"language-plaintext highlighter-rouge\">NCBITaxon</code> and <code class=\"language-plaintext highlighter-rouge\">taxonomy</code> are both acceptable as well. However,\nthis service doesn't load from the aforementioned TSV, but rather takes\nadvantage of PyOBO's internal code for looking up mappings. I can imagine lots\nof ways I might re-write this service to directly take advantage of this dump (I\nalso invite you to do the same, however best suits you) such as loading it into\nEdgeDB and auto-generating a GraphQL endpoint.</p>\n<p>The last thing that I'm looking into getting this service hosted so everyone can\nbenefit from it without doing dev-ops in their own organizations. Then I will\ncontinue to obfuscate all usage and documentation with references to pop\nculture. Enjoy!</p>","doi":"https://doi.org/10.59350/wmj3y-04914","guid":"https://cthoyt.com/2020/04/18/ooh-na-na","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1587168000,"rid":"81da2-mtb29","summary":"We have a big problem in the bioinformatics community with namespaces, identifiers, and names. And nobody's posed the question better than Rihanna herself.","tags":["OBO","Lexica"],"title":"Ooh Na Na, What's My Name?","updated_at":1781539907,"url":"https://cthoyt.com/2020/04/18/ooh-na-na.html","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>On top the issue of <a href=\"https://cthoyt.com/2020/04/18/ooh-na-na.html\">resolving identifiers to their\nnames</a>, the bioinformatics community has a\nhard time figuring out when two identifiers from different databases are\nequivalent. You know who else has the same problem? Inspector Javert. Get ready\nfor a <em>Les Miserables</em>-themed post on how to address this long-standing problem.</p>\n<p>I have to start my tale of woes by disclosing my source material. I loved both\nthe 1985 and 1987 recordings from the respective original London and Broadway\ncasts. But, for the purposes of this post, I will assume that you've seen the\nexcellent 2012 film adaptation of Alain Boublil, Jean-Marc Natel, and Herbert\nKretzmer's musical adaptation of Victor Hugo's novel <em>Les Miserables</em> and tell\nthe story through that perspective. I also want to you to know that I enjoyed\nRussell Crowe's Inspector Javert very much.</p>\n<p><em>Les Miserables</em> begins with the Work Song, in which the protagonist,\n<code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code> is confronted by Inspector Javert while doing some\u2026 work. He\ninsists he has a name, Jean Valjean and his identifier in his\n<a href=\"https://en.wikipedia.org/wiki/Faverolles,_Aisne\">home village</a>'s fictional\ndatabase (that I just retconned) was <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code>. Javert isn't interested\nin his name. It's enough that he has a cross-reference between <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code>\nis equivalent to <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. He was only there to inform Jean Valjean that\nhis parole has begun and issues him a <em>passeport jaune</em> (yellow ticket) for the\ncommune of <a href=\"https://en.wikipedia.org/wiki/Pontarlier\">Pontarlier</a>.</p>\n<p>I'm sure this passport also had an identifier on it. I'm going to take a bit of\ncreative freedom and say it was <code class=\"language-plaintext highlighter-rouge\">pontarlier:25791</code>. It probably also had Jean\nValjean's prisoner number on it so everybody knew he was in the 1800's fictional\nFrench convict database. The fictional 1800's French took maintaining\ncross-references very seriously.</p>\n<p>Jean Valjean never made it to Pontarlier. Instead, he broke his parole, forged\nsome new documents, and went to Montreuil-sur-Mer under the new name of Monsieur\nMadeleine. It's probably the case that his identifier for the Montreuil-sur-Mer\ncity database was <code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code>, or something like this (more\nretcons!). It must have been a good fake, because even the king of France\nrecognized him (note: this plot point did not appear in the film).</p>\n<p>Javert figured out Jean Valjean broke his parole basically immediately and set\nout on his quest to find and capture <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code> once again. Until this\npoint, Javert has access to the prisoner registry and yellow tickets. He knows\n<code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code> is the same as both <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code> and <code class=\"language-plaintext highlighter-rouge\">pontarlier:25791</code>.</p>\n<p>The part that will hit close to home for many bioinformaticians is that when\nJavert goes to Montreuil-sur-Mer, he meets Monsieur Madeleine. He is unaware\nthat it is Jean Valjean. There is no cross-reference between <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code>\nand <code class=\"language-plaintext highlighter-rouge\">pontarlier:25791</code> or <code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code>. If there were a\ncross-reference in the fictional French 1800's inspector database, Javert could\nhave arrested Jean Valjean on sight. Instead, Javert had to the hard work of\ncurating cross-references himself and finding out who was the same in the\n<code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer</code> database as <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. Admittedly, he probably would\nhave called this <em>inspecting</em>.</p>\n<p>The next part that will hit even closer to home for many bioinformaticians is\nthat after his inspecting, Javert actually identified the wrong guy! This lead\nto one of the my favorite songs in musical theater ever\n(<a href=\"https://www.youtube.com/watch?v=izuD30Cp5Ao\">Who Am I?</a>), where Monsieur\nMadeleine (<code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code>, also actually Jean Valjean, but Javert\ndidn't yet realize this) admits that he is actually <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. In this\nextended metaphor of a blog post, Jean Valjean's confession in \"Who Am I\" is\neffectively the same as a database providing its own cross-references to other\ndatabase. Would be nice if everyone did this, and did it well, huh?</p>\n<p>You should know that Javert is a powerful cross-reference reasoning machine. He\nalready knew <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code> was the same as <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. Now he knew\nthat <code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code> was the same as <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. This way, he\ncould infer that <code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code> (Monsieur Madeleine) is actually\n<code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code> (Jean Valjean). One of the nice properties of cross-references\nis that they're transitive through any number of connections. We'll take\nadvantage of this fact later. You'll also have to excuse the fact that\nthroughout this post, I'm operating under the assumption that \"cross-references\"\nand \"equivalences\" are the same thing. That's not always true, and sometimes it\ncan even get you in trouble. For example, provenance can be a cross-reference,\ndisease-gene associations are considererd as cross-references in MONDO (I\nthink), and OBO even gives specific semantics for when you should consider this\nassumption valid. We'll just have to live with it for now.</p>\n<p>Javert might have got lucky that Jean Valjean revealed himself once, but the\nshow must go on! Jean Valjean had many more songs to sing and thus had to escape\nfrom Montreuil-sur-Mer to Paris. This meant that Javert has to find <em>another</em>\nmapping to Jean Valjean's new <code class=\"language-plaintext highlighter-rouge\">paris</code> identifier. And we already know that the\nFrench 1800's inspector database of cross-references was not being maintained.\nExhausting!</p>\n<hr/>\n<p>In the bioinformatics community, we have a very similar problem to Inspector\nJavert. There are lots of databases that are talking about the same things, but\nonly a few of them provide mappings between each other. This means that we\neither have to curate our own cross-references, do our best to infer new\ncross-references based on ones we already have, or throw our hands in the air.</p>\n<p>Luckily, we have a few standardized resources to fall back on. In addition to\nstandardizing the storage of identifier/name pairs, the OBO format standardizes\nthe way cross-references are stored and the OBO Foundry already contains quite a\nfew cross-references imported from the ontologies that it covers.</p>\n<p>One of the most difficult entity types to map from database to database are\nphenotypes because of the variety of language used to describe each, the\ndifferences in semantics of how each is defined, and the sheer number of\ndatabases. Unfortunately, some of the most popular like MeSH and to an extent,\nUMLS, NCIT, SNOMED-CT, and ICD (seemingly the culprits are mostly American!?)\nprovide very little accessible information. Some are even paid, so the ony\ncross-references that exist are externally curated ones from other laudable\nsources like HP, DOID, and EFO. In fact, dealing with phenotypes is such a pain,\nthat there is a project called the\n<a href=\"https://monarchinitiative.org/\">Monarch Initiative</a> that has a huge staff\ntrying to solve exactly this problem and publish the results through the\n<a href=\"https://github.com/monarch-initiative/mondo\">Monarch Disease Ontology (MONDO)</a>.\nNormally, I would reference\n<a href=\"https://xkcd.com/927/\">this XKCD comic about making new standards</a> when hearing\nabout something like this. But these are dire times, and one of my opinions is\nthat you should always trust curators who love what they do.</p>\n<p>There are also lots of cross-references available from databases that don't\nmaintain their nomenclature as an ontology. One example is\n<a href=\"https://downloads.thebiogrid.org/File/BioGRID/Latest-Release/BIOGRID-IDENTIFIERS-LATEST.tab.zip\">BioGRID</a>,\nwhich assigns proteins internal accession numbers, but almost all of them\ncross-reference out to Entrez Gene (I counted less than 15 that didn't, and 3 of\nthem are COVID-related, so cut them some slack). As an aside, I don't really\nunderstand why BioGRID would go through the effort of maintaining their own\naccession numbers. In the literal handful of cases where they can't reference\nEntrez Gene, I think it would be better to email the maintainers and work with\nthem to make improvements.</p>\n<p>It's also worth noting that excellent resources like HGNC, MGI, RGD, SGD,\nEnsembl, UniProt, and others in the genome (and gene product) nomenclature do a\nstellar job at maintaining cross-references. So to all of the curators and\nmaintainers who work there, I would like to sincerely thank you.</p>\n<p>There are also community-curated cross-references sources. One of the notable\nones is from Harvard Medical School, that's mapping MeSH identifiers to gene\nidentifiers in the\n<a href=\"https://raw.githubusercontent.com/indralab/gilda/master/gilda/resources/mesh_mappings.tsv\">Gilda GitHub repository</a>.\nI think this is really a good time to point out that MeSH contains a bit of\neverything, is ubiquitous throughout the bioinformatics community, and in my\nopnion is is doing a huge disservice by not providing these kinds of mappings\nitself. Or, alternatively, it is, and both the Harvard guys and I have never\nfound it. It's not impossible, but we're all very motivated, so I think we would\nhave found if it did. If any MeSH maintainers are reading this and want help\nmaking this happen, I would be elated to donate my time to you to help solve\nthis problem.</p>\n<p>With all these data source in mind, I built an extensible pipeline in\n<a href=\"https://github.com/pyobo/pyobo/blob/master/src/pyobo/xrefdb/xrefs_pipeline.py\">PyOBO</a>\nfor extracting cross-references from entries in OBO Foundry and other\ncross-reference sources. Throughout the process, I realized that these sources\nhave an incredible variety in how they name prefixes and how the OBO format\nitself has been (ab)used. I made lots of improvements, wrote extensible code\nthat allowed the specification of new rules through external files (and thus\nless code writing in the future), and did lots more curation. I won't get into\nthe technical part of that here, since you can read the code (if you dare).</p>\n<p>After all that this coding, I wrote a script (just run <code class=\"language-plaintext highlighter-rouge\">obo javerts-xrefs</code>) that\ntakes all available cross-references, normalizes their namespaces, normalizes\ntheir identifiers, and dumps them in a big 'ol TSV file. 5 columns - source\nnamespace, source identifier, target namespace, target identifier, and\nprovenance (ontology name or URL). No nonsense. Get it at\n<a href=\"https://doi.org/10.5281/zenodo.3757266\"><img alt=\"DOI\" src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.3757266.svg\"/></a>.\nI'll make updates periodically as I add new sources.</p>\n<hr/>\n<p>Once you have a database of cross-references, you have actually built an\nundirected graph. Equivalences go both ways, and they are transitive. This means\nthat every connected component in an equivalence graph represents a set of\nentities that are mutually equivalent. In other words, if a path exists between\ntwo nodes in an equivalent graph, then they are equivalent.</p>\n<p>Even better, you don't have to materialize all of the possible inferred\nequivalences when you have an equivalence graph because identifying all of the\nnodes in a connected component can be done in linear time with respect to the\nsize of the connected component, which is usually pretty small, by using a\nbreadth-first or depth-first search.</p>\n<p>Based off of that, one application of an equivalence graph is to identify all of\nthe nodes that are equivalent to a given node. You can also get a little tricker\nand identify the paths through which the traversal must go if you want to\nestablish an equivalency. You could even go further and weight edges based on\nhow much you trust the source from which they came to identify how much you\nshould believe in a mapping. For example, if you have a percent confidence in\neach mapping being right, then the confidence in the whole pathway would be the\nproduct of the confidences.</p>\n<p>The actual problem I set out to solve was given a set of entities, remap all of\nthem based on a prioritized list. For example, I might have a set of entities\nthat contains HGNC genes, Entrez Genes, and OMIM genes. If my favorite\nnomenclature consortium is Entrez, my second favorite is HGNC, and my third\nfavorite is OMIM and I have an equivalence database, I might want to remap all\nof my identifiers. This is very important during the curation of mechanistic\nbiology (such as with BEL), since curators will likely use all sorts of\nidentifiers with no clear guidelines or rules. This means that the same entity\nmight appear twice with different identifiers in the same curated data!</p>\n<p>Given a priority list, you can even transform an equivalence graph into a\ndirected graph where each identifier has a single out edge pointing towards the\nidentifier that is the best mapping. Then, each connected component would become\na star graph. There's actually a better data structure for this, since each\nentity points to exactly one thing - a mapping. This is a more efficient data\nstructure for storage, and if your graph is implemented as an adjacency\ndictionary (becuase you're using <code class=\"language-plaintext highlighter-rouge\">networkx</code>, right?), then you basically already\nhave this.</p>\n<p>I've provided an implementation for all of these in PyOBO. They can be run as a\nweb API with <code class=\"language-plaintext highlighter-rouge\">python -m pyobo.apps.mapper</code>. There's a keyword argument to allow\nyou to load the TSV from Inspector Javert's Xref Database directly, or if you're\nfeeling lucky, to regenerate it yourself. Below I will give a few examples of\nhow to use it. Later, I would also like to host this service for anyone to use.</p>\n<ol>\n<li>Install PyOBO with <code class=\"language-plaintext highlighter-rouge\">pip install git+https://github.com/pyobo/pyobo.git</code></li>\n<li>Download Inspector Javert's Xref Database from Zenodo, unpack it, and find\nthe xrefs file.</li>\n<li>Run the web service with\n<code class=\"language-plaintext highlighter-rouge\">python -m pyobo.apps.mapper -x inspector_javerts_xrefs.tsv.gz</code></li>\n<li>Use the following code to figure stuff out!</li>\n</ol>\n<div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"kn\">import</span> <span class=\"nn\">requests</span>\n\n<span class=\"c1\"># Get all entities mapped to MAPT, including through chains of xrefs\n</span><span class=\"n\">successful_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/mappings/hgnc:6893'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"s\">\"\"\"\n{\n    \"orphanet:123144\": [\n        {\n            \"provenance\": \"hgnc\",\n            \"source\": \"hgnc:6893\",\n            \"target\": \"orphanet:123144\"\n        }\n    ],\n    \"pr:P10636\": [\n        {\n            \"provenance\": \"hgnc\",\n            \"source\": \"hgnc:6893\",\n            \"target\": \"uniprot:P10636\"\n        },\n        {\n            \"provenance\": \"pr\",\n            \"source\": \"uniprot:P10636\",\n            \"target\": \"pr:P10636\"\n        }\n    ],\n    ...\n}\n\"\"\"</span>\n\n<span class=\"c1\"># Keep in mind this isn't a validation service\n</span><span class=\"n\">unsuccessful_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/mappings/hgnc:0000'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"message\": \"could not find curie\", \"query\": {\"curie\": \"hgnc:0000\"}, \"success\": False}\n</span>\n<span class=\"c1\"># Get all paths mapping MAPT in HGNC to Ensembl. Returns a list of paths (which are lists of xrefs)\n</span><span class=\"n\">path_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/mappings/hgnc:6893/ensembl:ENSG00000186868'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"s\">\"\"\"\n[\n    [\n        {\n            \"provenance\": \"hgnc\",\n            \"source\": \"hgnc:6893\",\n            \"target\": \"ensembl:ENSG00000186868\"\n        }\n    ]\n]\n\"\"\"</span>\n\n<span class=\"c1\"># Get the priority identifier for MAPT identified by Ensembl\n</span><span class=\"n\">prioritize_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/prioritize/cosmic:MAPT'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"found\": True, \"query\": \"cosmic:MAPT\", \"result\": \"hgnc:6893\"}\n</span>\n<span class=\"c1\"># What happens when a CURIE can't be found for prioritization\n</span><span class=\"n\">unsuccessful_prioritize_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/prioritize/cosmic:NOPE'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"found\": False, \"query\": \"cosmic:NOPE\"}\n</span></code></pre></div></div>\n<p>I'd like to give a big thanks to my high school music teacher, Ken Tedeschi, for\nhelping me (and basically everyone else) fall in love with Les Mis in high\nschool. Writing about my work was so much more fun in extended metaphor. I would\nalso like to thank Hugh Jackman. You know, for being Hugh Jackman.</p>\n<hr/>\n<p>I have some random afterthoughts that I think might be worth including, that I'm\nadding after originally posting this.</p>\n<p>You might be wondering why I didn't get into a discussion about the\n<a href=\"https://www.ebi.ac.uk/about/news/announcement/industry-collaboration-ontology-mapping-service\">Ontology Mapping Service (OXO)</a>\nfrom the EBI. It looks to me like this project has been abandoned. Even if not,\nit's API has most of the same issues that I described in a <a href=\"https://cthoyt.com/2020/04/18/ooh-na-na.html\">previous\npost</a>.</p>\n<p>I'm also aware of <a href=\"https://bridgedb.github.io\">BridgeDB</a>, from which I think I\nwill be able to take inspiration to include more xrefs later. However, I think\nthey're limited in scope, and PyOBO is more about standardizing data so nobody\nhas to figure out databases\u2026 again and again and again.</p>\n<p>One glaring omission from this work is WikiData mappings. I have a plan to\ninclude curated information in the PyOBO metaregistry that links databases to\ntheir WikiData properties. That will allow me to build an automated framework\nfor downloading these mappings, given the curation of the properties.</p>","doi":"https://doi.org/10.59350/r3qzt-z0d08","guid":"https://cthoyt.com/2020/04/19/inspector-javerts-xref-database","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1587254400,"rid":"vqbr3-bcb03","summary":"On top the issue of resolving identifiers to their names, the bioinformatics community has a hard time figuring out when two identifiers from different databases are equivalent. You know who else has the same problem? Inspector Javert. Get ready for a Les Miserables-themed post on how to address this long-standing problem.","tags":["Mappings"],"title":"Inspector Javert's Xref Database","updated_at":1781539905,"url":"https://cthoyt.com/2020/04/19/inspector-javerts-xref-database.html","version":"v1"}},{"document":{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>As scientists, we place huge importance on the communication of our results. We\nspend lots of time on editing, revising, and formatting so people can understand\nwhat we did. We also write a lot of code, so why aren't we investing the same\namount of love? Enter, <a href=\"https://flake8.pycqa.org/en/latest/\">flake8</a>.</p>\n<p>It's incredibly important that we write following community standards so when\nother people read our work, they don't have to think about how it's organized.\nFor scientific prose, this usually means the IMRD\n(introduction-methods-results-discussion) format. In Python, my current favorite\nprogramming language for science, this means using a standardized number of\nspaces for indents (4), using triple-double quotes for docstrings in the\nbeginning of each module, class, and function, and lots more.</p>\n<p>It's pretty intimidating to figure out style. For english prose, Strunk and\nWhite wrote\n<a href=\"http://www.jlakes.org/ch/web/The-elements-of-style.pdf\"><em>The Elements of Style</em></a>.\nFor Python, Guido van Rossum wrote\n<a href=\"https://www.python.org/dev/peps/pep-0008/\">PEP-8</a> and Raymond Hettinger\npresented <a href=\"https://www.youtube.com/watch?v=wf-BqAjZb8M\">Beyond PEP-8</a>. Even with\nthese resources, it's still hard to learn which are rules and which\n<a href=\"https://www.youtube.com/watch?v=k9ojK9Q_ARE\">are more like guidelines</a>.</p>\n<p>This post is a short explanation of how I use <code class=\"language-plaintext highlighter-rouge\">flake8</code> to keep a consistent\nstyle in the code in my Python projects. There's a similar command line tool for\nfixing the style in R projects that's already built into most operating\nsystems - <code class=\"language-plaintext highlighter-rouge\">rm -rf *</code>, but I won't get more into that here.</p>\n<p>It's pretty easy to get up and running with <code class=\"language-plaintext highlighter-rouge\">flake8</code> - just run\n<code class=\"language-plaintext highlighter-rouge\">pip install flake8</code> then use it from the shell on a python file like\n<code class=\"language-plaintext highlighter-rouge\">flake8 my_file.py</code> or <code class=\"language-plaintext highlighter-rouge\">flake8 my_directory/</code>. Then, it outputs a list of\nproblems that need to be fixed on a line-by-line basis in your code.</p>\n<p><img alt=\"Flake8 Feedback\" src=\"https://cthoyt.com/img/flake8_output.png\"/></p>\n<p>You can also install plugins with <code class=\"language-plaintext highlighter-rouge\">pip</code> like that extend the kinds of things it\nchecks. A few that I install are:</p>\n<ul>\n<li><a href=\"https://github.com/gforcada/flake8-builtins\">flake8-builtins</a> - make sure you\ndon't accidentally name a variable the same thing as a builtin. This happens a\nlot with <code class=\"language-plaintext highlighter-rouge\">id</code>.</li>\n<li><a href=\"https://github.com/PyCQA/flake8-bugbear\">flake8-bugbear</a> - \"find likely bugs\nand design problems in your program\", like when you have an unused variable in\na loop</li>\n<li><a href=\"https://github.com/and3rson/flake8-colors\">flake8-colors</a> - add color to the\n<code class=\"language-plaintext highlighter-rouge\">flake8</code> output (explanation how to set up is below)</li>\n<li><a href=\"https://github.com/PyCQA/flake8-commas\">flake8-commas</a> - add trailing commas\nwhere appropriate</li>\n<li><a href=\"https://github.com/adamchainz/flake8-comprehensions\">flake8-comprehensions</a>\nreminders to use list comprehensions where appropriate</li>\n<li><a href=\"https://github.com/PyCQA/flake8-docstrings\">flake8-docstrings</a> - make sure\nyour docstrings are present and written in the right format</li>\n<li><a href=\"https://github.com/PyCQA/flake8-import-order\">flake8-import-order</a> - make\nsure your imports are organized properly</li>\n<li><a href=\"https://github.com/JBKahn/flake8-print\">flake8-print</a> - make sure you never\never ever use <code class=\"language-plaintext highlighter-rouge\">print()</code>. The literal only exception is when using print to get\ntext into a file with <code class=\"language-plaintext highlighter-rouge\">print(..., file=...)</code></li>\n<li><a href=\"https://github.com/MichaelKim0407/flake8-use-fstring\">flake8-use-fstring</a> -\nmake sure you're using f-strings instead of <code class=\"language-plaintext highlighter-rouge\">%</code> or <code class=\"language-plaintext highlighter-rouge\">.format()</code> formatting.\nException being for logging.</li>\n<li><a href=\"https://github.com/PyCQA/pep8-naming\">pep8-naming</a> - make sure names of\nvariables, classes, and modules look right.</li>\n<li><a href=\"https://github.com/PyCQA/pydocstyle/\">pydocstyle</a> - docstring style checker</li>\n</ul>\n<p>In each of my repositories, I put all of the information on how to install\n<code class=\"language-plaintext highlighter-rouge\">flake8</code> and its plugins then run them in a <code class=\"language-plaintext highlighter-rouge\">tox</code> configuration under the\n<code class=\"language-plaintext highlighter-rouge\">[testenv:flake8]</code> header so they can easily reproducibly run with\n<code class=\"language-plaintext highlighter-rouge\">tox -e flake8</code>. An example of part of one of my <code class=\"language-plaintext highlighter-rouge\">tox.ini</code> files (which always\nlives in the root of the repository) is below:</p>\n<div class=\"language-ini highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nn\">[testenv:flake8]</span>\n<span class=\"py\">skip_install</span> <span class=\"p\">=</span> <span class=\"s\">true</span>\n<span class=\"py\">deps</span> <span class=\"p\">=</span>\n    <span class=\"err\">flake8</span>\n    <span class=\"err\">flake8-bandit</span>\n    <span class=\"err\">flake8-builtins</span>\n    <span class=\"err\">flake8-bugbear</span>\n    <span class=\"err\">flake8-colors</span>\n    <span class=\"err\">flake8-commas</span>\n    <span class=\"err\">flake8-comprehensions</span>\n    <span class=\"err\">flake8-docstrings</span>\n    <span class=\"err\">flake8-import-order</span>\n    <span class=\"err\">flake8-print</span>\n    <span class=\"err\">flake8-use-fstring</span>\n    <span class=\"err\">pep8-naming</span>\n    <span class=\"err\">pydocstyle</span>\n<span class=\"py\">commands</span> <span class=\"p\">=</span>\n    <span class=\"err\">flake8</span> <span class=\"err\">src/pybel/</span> <span class=\"err\">tests/</span> <span class=\"err\">setup.py</span>\n<span class=\"py\">description</span> <span class=\"p\">=</span> <span class=\"s\">Run the flake8 tool with several plugins (bandit, docstrings, import order, pep8 naming).</span>\n</code></pre></div></div>\n<p>Another configuration file you can set up in the root of the repository is\n<code class=\"language-plaintext highlighter-rouge\">.flake8</code>. Unfortunately, the Python configuration file reader doesn't allow\nsome of the crazy characters that I want to use for the colors so this can't be\nincluded in <code class=\"language-plaintext highlighter-rouge\">setup.cfg</code> or <code class=\"language-plaintext highlighter-rouge\">tox.ini</code> like most of your other configuration.</p>\n<div class=\"language-ini highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nn\">[flake8]</span>\n<span class=\"py\">ignore</span> <span class=\"p\">=</span>\n    <span class=\"c\"># line break before binary operator\n</span>    <span class=\"err\">W503</span>\n<span class=\"py\">exclude</span> <span class=\"p\">=</span>\n    <span class=\"err\">.tox,</span>\n    <span class=\"err\">.git,</span>\n    <span class=\"err\">__pycache__,</span>\n    <span class=\"err\">docs/source/conf.py,</span>\n    <span class=\"err\">build,</span>\n    <span class=\"err\">dist,</span>\n    <span class=\"err\">tests/fixtures/*,</span>\n    <span class=\"err\">*.pyc,</span>\n    <span class=\"err\">*.egg-info,</span>\n    <span class=\"err\">.cache,</span>\n    <span class=\"err\">.eggs</span>\n<span class=\"py\">max-line-length</span> <span class=\"p\">=</span> <span class=\"s\">120</span>\n<span class=\"py\">import-order-style</span> <span class=\"p\">=</span> <span class=\"s\">pycharm</span>\n<span class=\"py\">application-import-names</span> <span class=\"p\">=</span>\n    <span class=\"err\">pybel</span>\n    <span class=\"err\">bel_resources</span>\n    <span class=\"err\">tests</span>\n<span class=\"py\">format</span> <span class=\"p\">=</span> <span class=\"s\">${cyan}%(path)s${reset}:${yellow_bold}%(row)d${reset}:${green_bold}%(col)d${reset}: ${red_bold}%(code)s${reset} %(text)s</span>\n</code></pre></div></div>\n<p>First thing you'll notice is the <code class=\"language-plaintext highlighter-rouge\">ignore</code> list. This isn't here to turn <code class=\"language-plaintext highlighter-rouge\">flake8</code>\noff because you're feeling lazy. If somebody includes a change in this list in\ntheir PR, you have to explain to them that compliance is not optional, then help\nthem work through the problem that they obviously gave up on solving. It's\nactually there for you, as the project maintainer, to enumerate the <code class=\"language-plaintext highlighter-rouge\">flake8</code>\nrules that you don't agree with. For example, I totally disagree with the <code class=\"language-plaintext highlighter-rouge\">W503</code>\nline break before operator rule. I want to write long conditionals with and\nstatements on the first line, like this:</p>\n<div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"k\">if</span> <span class=\"p\">(</span>\n   <span class=\"n\">condition_1</span>\n   <span class=\"ow\">and</span> <span class=\"n\">condition_2</span>\n   <span class=\"ow\">and</span> <span class=\"n\">condition_3</span>\n<span class=\"p\">):</span>\n    <span class=\"k\">print</span><span class=\"p\">(</span><span class=\"s\">'all true'</span><span class=\"p\">)</span>\n</code></pre></div></div>\n<p>One of the benefits of this style is you can add more lines with only single\nline diffs. The other is that the reader always sees the operation that goes\nwith each line. Same could be done with arithmatic that could incorporate not\nonly <code class=\"language-plaintext highlighter-rouge\">+</code> but also <code class=\"language-plaintext highlighter-rouge\">-</code>.</p>\n<p>Next is the <code class=\"language-plaintext highlighter-rouge\">exclude</code> block. Just copy/paste this each time, since it has lots\nof garbage you don't want <code class=\"language-plaintext highlighter-rouge\">flake8</code> to bother with. One of the checkers in\n<code class=\"language-plaintext highlighter-rouge\">flake8</code> is for function \"cyclomatic\" complexity. You can make the maximum\nnumber higher with <code class=\"language-plaintext highlighter-rouge\">max-complexity</code>. Normally, you want this to be enforced, but\nsometimes there's no way around a complex function. For this, you can add a code\ncomment <code class=\"language-plaintext highlighter-rouge\">noqa</code> followed by the error code like <code class=\"language-plaintext highlighter-rouge\"># noqa:W123</code>. Again, adding tags\nto ignore bad style just to pass <code class=\"language-plaintext highlighter-rouge\">flake8</code> is against the point.</p>\n<p>The <code class=\"language-plaintext highlighter-rouge\">max-line-length</code> is a very contentious setting. I think 120 is fine. Some\npeople think 78, 79, or 80 is best because of the standard sizes of old computer\nscreens or punch cards\u2026 When I get older and I can't read my computer screen,\nI'll probably make the text bigger and change my mind about this. If you find\nyourself breaking up lines in a totally non-sensical, unstyled way, then you're\nconforming too tightly to the rules. Sorry about the mixed messages!</p>\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>import-order-style = pycharm\napplication-import-names =\n    pybel\n    bel_resources\n    tests\n</code></pre></div></div>\n<p>I copied this again because this part is really important. You have to tell\n<code class=\"language-plaintext highlighter-rouge\">flake8</code> what rules you use for import order. I use the pycharm rules, which\ngroup python builtin packages, then 3rd party packages, then my packages. The\n<code class=\"language-plaintext highlighter-rouge\">application-import-names</code> is a place to list what are your packages.</p>\n<p>Last is the <code class=\"language-plaintext highlighter-rouge\">format</code> entry, which gives the nice colorful output. Copy paste\nthis! I borrowed mine from <a href=\"https://github.com/scolby33\">Scott Colby</a>.</p>\n<hr/>\n<p>After all of that, I set up Travis CI to run <code class=\"language-plaintext highlighter-rouge\">tox</code> every time code is pushed to\nthe repository. If you're working in a team, you probably do something like the\nfork/pull request or branch/pull request workflow on GitHub to support doing\ncode review before merging new code. The best part is that there's a big box on\neach pull request that checks if <code class=\"language-plaintext highlighter-rouge\">flake8</code> passed (among other tests), which\nmeans that there were no errors detected.</p>\n<p>I encourage my teammates to make pull requests as soon as they start working on\ncode. GitHub even has a \"draft pull request\" mode now. However, before asking\nanyone to review your code, it has to pass <code class=\"language-plaintext highlighter-rouge\">flake8</code>. And obviously, no code that\nisn't passing flake8 can be merged.</p>\n<p>This is a <em>very</em> painful process to get people used to. I've done it with many\ngroups of people and always got pushback. However, everyone who has gone through\nthe process with me has come out the other side happy that they did it. It's\nimportant that when you start enforcing coding rules on other people that you\nare a resource for them - when somebody is frustrated by a flake8 error code\nthey have never seen, they will likely forget how to use Google. They will\nprobably ask you for help. You have to resist the urge to send\n<a href=\"https://lmgtfy.com\">lmgtfy</a> links to them and be patient. Because eventually,\nthey will do it on their own, and spread the gospel of <code class=\"language-plaintext highlighter-rouge\">flake8</code>.</p>\n<p>While a good arsenal of <code class=\"language-plaintext highlighter-rouge\">flake8</code> plugins provides a solid foundation, it's not\nall that needs to be done to make your code readable and look good. Just like\nwith reading and speaking, the best way to develop a sense of style is by\nreading <em>lots</em> of code (with the caveat that reading poorly written code\nprobably won't teach you much). Within the rules imposed by <code class=\"language-plaintext highlighter-rouge\">flake8</code>, there is\nlots of space for style. If you watch lectures from David Beazley, you'll notice\na very different style from Raymond Hettinger, and also from me.</p>\n<p>Now that you've made it to the end of this short guide, I wish you the best of\nluck on developing your own style!</p>\n<hr/>\n<p>Are you working with people who are particularly unsusceptible to Travis CI\nemails or checking the big red box on pull requests? You could try getting them\nset up with <a href=\"https://pre-commit.com/\">pre-commit hooks</a>, which run the style\nchecks locally any time they try and push (even if it's to a branch) and it will\ngive them the message in the console.</p>\n<p>Is style not your thing at all / you're not ready to let go of your identity as\na Java/Perl developer? Maybe consider <a href=\"https://github.com/psf/black\">Black</a>,\nwhich actually re-writes your code in a deterministic style. I don't live by it,\nbut it's a great tool to run on a code base that's never been loved before going\nback and stylizing it.</p>","doi":"https://doi.org/10.59350/cfxtw-0ma23","guid":"https://cthoyt.com/2020/04/25/how-to-code-with-me-flake8","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1587772800,"rid":"4y55t-c1k96","summary":"As scientists, we place huge importance on the communication of our results. We spend lots of time on editing, revising, and formatting so people can understand what we did. We also write a lot of code, so why aren't we investing the same amount of love? Enter, flake8.","tags":["Code With Me"],"title":"How to Code with Me - Flake8 Hell","updated_at":1781539903,"url":"https://cthoyt.com/2020/04/25/how-to-code-with-me-flake8.html","version":"v1"}}],"items":[{"authors":[{"affiliation":[{"id":"https://ror.org/050qmg959","name":"Singapore Management University"}],"contributor_roles":[],"family":"Tay","given":"Aaron","url":"https://orcid.org/0000-0003-0159-013X"}],"blog":{"authors":null,"community_id":"f34e2211-9904-4b58-97ab-0beeb79ef6f7","created":1697068800,"current_feed_url":null,"description":"Aaron Tay's thoughts about academic librarianship","favicon":"https://rogue-scholar.org/api/communities/f34e2211-9904-4b58-97ab-0beeb79ef6f7/logo","feed_format":"application/rss+xml","feed_url":"https://aarontay.substack.com/feed","filter":null,"generator":"Substack","home_page_url":"https://aarontay.substack.com","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"musings","status":"active","subfield":"3309","title":"Aaron Tay's Musings about Librarianship","updated":1781540136,"use_api":true},"blog_name":"Aaron Tay's Musings about Librarianship","blog_slug":"musings","content_html":"<p><em>This post is part of a \"hot takes\" series in which I make sharper claims than I usually do. I do not intend to offend, and I am not trying to tar every librarian with the same brush \u2014 the patterns I describe and perceive may be a function of my own local context. </em></p><p><em>In my last hot takes post, <a href=\"https://aarontay.substack.com/p/hot-take-stop-calling-poor-search\">I argued that while designing a search system to maximise learning gains may not always align with designing a search system that scores the best for relevancy, unplanned friction in learning aka poor relevancy is never a good idea.</a></em></p><p><em>In this post, I consider the idea of tools that are considered flawed because they can't match the performance of an older tool in some way<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-1\" href=\"#footnote-1\" target=\"_self\">1</a>.</em></p><p class=\"button-wrapper\" data-attrs=\"{&quot;url&quot;:&quot;https://ko-fi.com/aarontay&quot;,&quot;text&quot;:&quot;Buy me coffee via ko-fi!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}\" data-component-name=\"ButtonCreateButton\"><a class=\"button primary\" href=\"https://ko-fi.com/aarontay\"><span>Buy me coffee via ko-fi!</span></a></p><div class=\"captioned-image-container\"><figure><a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"https://substackcdn.com/image/fetch/$s_!ciEX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png\" data-component-name=\"Image2ToDOM\"><div class=\"image2-inset\"><picture><source type=\"image/webp\" srcset=\"https://substackcdn.com/image/fetch/$s_!ciEX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 424w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 848w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 1272w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 1456w\" sizes=\"100vw\"><img src=\"https://substackcdn.com/image/fetch/$s_!ciEX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png\" width=\"1134\" height=\"651\" data-attrs=\"{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:651,&quot;width&quot;:1134,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" class=\"sizing-normal\" alt=\"\" srcset=\"https://substackcdn.com/image/fetch/$s_!ciEX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 424w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 848w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 1272w, https://substackcdn.com/image/fetch/$s_!ciEX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png 1456w\" sizes=\"100vw\" fetchpriority=\"high\"></picture><div class=\"image-link-expand\"><div class=\"pencraft pc-display-flex pc-gap-8 pc-reset\"><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container restack-image\"><svg role=\"img\" width=\"20\" height=\"20\" viewBox=\"0 0 20 20\" fill=\"none\" stroke-width=\"1.5\" stroke=\"var(--color-fg-primary)\" stroke-linecap=\"round\" stroke-linejoin=\"round\" xmlns=\"http://www.w3.org/2000/svg\"><g><title></title><path d=\"M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882\"></path></g></svg></button><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container view-image\"><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"20\" height=\"20\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"lucide lucide-maximize2 lucide-maximize-2\"><polyline points=\"15 3 21 3 21 9\"></polyline><polyline points=\"9 21 3 21 3 15\"></polyline><line x1=\"21\" x2=\"14\" y1=\"3\" y2=\"10\"></line><line x1=\"3\" x2=\"10\" y1=\"21\" y2=\"14\"></line></svg></button></div></div></div></a></figure></div><p>Imagine you are a librarian in 2004, <a href=\"https://googleblog.blogspot.com/2004/10/scholarly-pursuits.html\">when Google Scholar launches in beta</a>.<br>You have read the studies. <a href=\"https://www.emerald.com/oir/article/29/2/208/315378/Google-Scholar-the-pros-and-the-cons\">The coverage gaps are real</a>. <a href=\"https://vlex.co.uk/vid/metadata-mega-mess-in-846721887\">The metadata is wretched</a>. Nobody, including Google, can tell you exactly what is indexed<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-2\" href=\"#footnote-2\" target=\"_self\">2</a>. </p><p>You decide it is the dumbest thing on the planet and declare that it could never be useful to anyone, despite a vocal minority of users insisting otherwise.</p><p>Fast forward to 2024. Google Scholar has become the default academic search starting point for many researchers and is widely regarded as the most comprehensive free academic search engine<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-3\" href=\"#footnote-3\" target=\"_self\">3</a>. Your first assumption might be that Google fixed all the problems.</p><p>But that assumption would be mostly wrong.</p><p>Sure, Google Scholar improved, especially in coverage. Its metadata may have arguably also have improved. But you still cannot consult a stable, auditable list of indexed sources or records. Even Google's own guidance effectively tells users to test coverage by sampling titles rather than by checking a definitive inventory. One of the fundamental weaknesses librarians diagnosed in 2004 is still there.</p><p>And yet the librarians and researchers of 2024 are not idiots.</p><p>What happened is simple. Users learnt to compensate. They used Scholar for what it was good at and routed around its flaws or used another tool for other use cases (e.g. Evidence synthesis). They learnt when its metadata could not be trusted, when its coverage was opaque, and when to complement with another database was needed. The tool did not become perfect. It became useful enough, along side other tools.</p><div class=\"captioned-image-container\"><figure><a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"https://substackcdn.com/image/fetch/$s_!_kqs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png\" data-component-name=\"Image2ToDOM\"><div class=\"image2-inset\"><picture><source type=\"image/webp\" srcset=\"https://substackcdn.com/image/fetch/$s_!_kqs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 424w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 848w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 1272w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 1456w\" sizes=\"100vw\"><img src=\"https://substackcdn.com/image/fetch/$s_!_kqs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png\" width=\"999\" height=\"632\" data-attrs=\"{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:999,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:814374,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aarontay.substack.com/i/200014373?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" class=\"sizing-normal\" alt=\"\" srcset=\"https://substackcdn.com/image/fetch/$s_!_kqs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 424w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 848w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 1272w, https://substackcdn.com/image/fetch/$s_!_kqs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50eeb872-ee6d-44db-b0e5-9f73776df674_999x632.png 1456w\" sizes=\"100vw\" loading=\"lazy\"></picture><div class=\"image-link-expand\"><div class=\"pencraft pc-display-flex pc-gap-8 pc-reset\"><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container restack-image\"><svg role=\"img\" width=\"20\" height=\"20\" viewBox=\"0 0 20 20\" fill=\"none\" stroke-width=\"1.5\" stroke=\"var(--color-fg-primary)\" stroke-linecap=\"round\" stroke-linejoin=\"round\" xmlns=\"http://www.w3.org/2000/svg\"><g><title></title><path d=\"M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882\"></path></g></svg></button><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container view-image\"><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"20\" height=\"20\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"lucide lucide-maximize2 lucide-maximize-2\"><polyline points=\"15 3 21 3 21 9\"></polyline><polyline points=\"9 21 3 21 3 15\"></polyline><line x1=\"21\" x2=\"14\" y1=\"3\" y2=\"10\"></line><line x1=\"3\" x2=\"10\" y1=\"21\" y2=\"14\"></line></svg></button></div></div></div></a></figure></div><p></p><p>That trajectory is suprisingly common with new technology and one I think about  when I read arguments thay say AI or any technology tools can never be useful because of some \"fundamental flaw\".</p><h2>The concession has already been made on usefulness</h2><div class=\"captioned-image-container\"><figure><a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"https://substackcdn.com/image/fetch/$s_!z3Ha!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png\" data-component-name=\"Image2ToDOM\"><div class=\"image2-inset\"><picture><source type=\"image/webp\" srcset=\"https://substackcdn.com/image/fetch/$s_!z3Ha!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 424w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 848w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 1272w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 1456w\" sizes=\"100vw\"><img src=\"https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png\" width=\"1456\" height=\"740\" data-attrs=\"{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:740,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1543664,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://aarontay.substack.com/i/200014373?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}\" class=\"sizing-normal\" alt=\"\" srcset=\"https://substackcdn.com/image/fetch/$s_!z3Ha!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 424w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 848w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 1272w, https://substackcdn.com/image/fetch/$s_!z3Ha!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4127aecf-6c4d-4cd2-80e4-29958b76726f_1758x894.png 1456w\" sizes=\"100vw\" loading=\"lazy\"></picture><div class=\"image-link-expand\"><div class=\"pencraft pc-display-flex pc-gap-8 pc-reset\"><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container restack-image\"><svg role=\"img\" width=\"20\" height=\"20\" viewBox=\"0 0 20 20\" fill=\"none\" stroke-width=\"1.5\" stroke=\"var(--color-fg-primary)\" stroke-linecap=\"round\" stroke-linejoin=\"round\" xmlns=\"http://www.w3.org/2000/svg\"><g><title></title><path d=\"M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882\"></path></g></svg></button><button tabindex=\"0\" type=\"button\" class=\"pencraft pc-reset pencraft icon-container view-image\"><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"20\" height=\"20\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"lucide lucide-maximize2 lucide-maximize-2\"><polyline points=\"15 3 21 3 21 9\"></polyline><polyline points=\"9 21 3 21 3 15\"></polyline><line x1=\"21\" x2=\"14\" y1=\"3\" y2=\"10\"></line><line x1=\"3\" x2=\"10\" y1=\"21\" y2=\"14\"></line></svg></button></div></div></div></a></figure></div><p>Some of the most prominent sceptics have already said as much.</p><p>Gary Marcus, cognitive scientist, author of Rebooting AI, and one of the most consistent public critics of AI hype, has repeatedly acknowledged that LLMs can be useful, especially for coding, brainstorming and writing, while arguing that they are unreliable and not a route to AGI alone.</p><p>Margaret Mitchell, a co-author of the influential \"<a href=\"https://dl.acm.org/doi/10.1145/3442188.3445922\">Stochastic Parrots</a>\" paper, has been even more explicit: <a href=\"https://medium.com/@margarmitchell/no-ai-is-not-a-stochastic-parrot-a99e57766bed\">LLMs can be \"extremely useful\"</a>.</p><p><a href=\"https://medium.com/@emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11\">Emily M. Bender has likewise clarified that \"stochastic parrot\" is a description or metaphor, not an argument that these systems have no possible utility</a><a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-4\" href=\"#footnote-4\" target=\"_self\">4</a><a href=\"https://medium.com/@emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11\">.</a></p><p>Mike Caulfield, of SIFT fame, is actively studying and using these tools. These are not AI boosters. If even they concede that the tools can be useful, the basic question has already been settled.</p><p>You can oppose LLM use on environmental grounds, labour grounds, epistemic grounds, or any number of other defensible grounds. </p><p>But the claim that these tools can never be useful to anyone has moved past argument into something closer to an unfalsifiable position. No demonstration of utility, no improvement in the tools, and no evidence of successful use by researchers seems able to update it. </p><h2>A note on scope</h2><p>For the rest of this post, when I say \"AI tools\", I mean AI-powered academic search tools. I work in this space, so I will stay in my lane. The argument may extend elsewhere, but I am not pretending to make that case here.</p><p>It is also worth being precise about what \"AI search\" means. <a href=\"https://aarontay.substack.com/p/what-do-we-actually-mean-by-ai-powered\">The term covers several different things: changing what gets retrieved, reranking results, summarising content, and generating direct answers to questions</a>. These are not the same capability and do not carry the same risks.</p><p>A librarian who objects to generative answer synthesis is making a different argument from one who objects to AI-assisted reranking. Conflating the two muddles the debate. Before objecting to \"AI search\", it is worth saying which part concerns you, and why.</p><p>The Google Scholar analogy maps most cleanly onto AI-assisted retrieval and reranking: helping surface relevant results that users might otherwise miss. It also maps reasonably well onto \"tip-of-the-tongue\" search, one of the limited uses Bender has acknowledged as potentially useful.</p><p>It maps less directly onto generative answer synthesis, where hallucination risks are sharper. <a href=\"https://aarontay.substack.com/p/what-do-we-actually-mean-by-ai-powered\">I am not arguing that all uses of AI in search carry equal risk</a>. I am arguing that even the riskiest versions clear the \"never useful\" bar. The appropriate response to different risk profiles is <a href=\"https://aarontay.substack.com/p/why-use-of-new-ai-enhanced-tools-that\">differentiated teaching, not blanket rejection.</a></p><h2>Back to the analogy</h2><p>The people who insisted in 2004 that Google Scholar could never be useful until Google published full holdings lists were sure they were right. But they were eventually proven wrong to conclude that a tool could never be useful without that.</p><p><em>A tool need not be perfect to be useful</em>. This sounds obvious, but it is the point that keeps getting lost.</p><blockquote><p>One objection is that the Scholar analogy fails because LLM errors are different. Google Scholar had messy metadata and opaque coverage. LLMs produce overconfident hallucinations.</p><p>That objection has force, but notice what it actually supports. It supports teaching verification skills. It supports scaffolding. It supports appropriate scepticism. It does not support the early conclusion that the tool is useless.</p><p>A tool that requires careful handling is not the same as a tool that cannot be useful. The verification argument also cuts both ways. Uncritical acceptance of search results, including Google Scholar results, has always been the failure mode librarians teach against.</p><p>Perhaps the scaffolding still will not be enough. But I am sure many librarians who rejected Google Scholar in 2004 were pretty sure too.</p></blockquote><p>Nor does it help that some of the most vocal sceptics seem not to have engaged seriously with these tools since 2023. They appear to underestimate how much the systems, and the harnesses around them, have changed even just three years. But even that point is secondary. The Scholar parallel does not depend on the exact pace of improvement<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-5\" href=\"#footnote-5\" target=\"_self\">5</a>.  </p><blockquote><p>I currently believe from some experience setting up and testing agents, that between the use of code to constrain the model and aggressive multi-validation checks, you can reduce the probability of error/hallucatins down to very low levels comparable to the human.</p></blockquote><p>The lesson is not that tools improve. The lesson is that what counts as good enough is not always obvious at first.</p><h2>Three objections </h2><p>Some librarians argue that the Google Scholar comparison to AI breaks down on three grounds: librarians do not actively promote Scholar (but librarians actively promote AI); users genuinely want Scholar rather than having it pushed on them; and Scholar is free, whereas many AI tools are commercial products.</p><p>None of these objections does the work required.</p><p>On promotion, many librarians do promote Google Scholar. It appears in LibGuides, instruction sessions and one-on-one consultations. The claim that \"we do not promote it\" is a polite fiction. Beyond what is said publicly, plenty of academic librarians reach for Google, Scholar or Wikipedia first in their own work when the situation calls for it.</p><p>On user demand, users clearly want AI tools. Whatever one thinks of that demand, pretending it does not exist is not a serious position.</p><p>On cost, being free or paid is a separate question from whether a tool can be useful. There are genuine concerns about commercial AI: vendor lock-in, inequity of access, environmental cost, labour implications, surveillance, and the commercial capture of scholarly infrastructure. Those concerns deserve serious engagement.</p><p>But they are arguments about adoption, governance and institutional support. They are not arguments that the tools can never be useful.</p><p>And let us be honest about the comparison. Library databases have plenty of flaws: idiosyncratic interfaces, uneven indexing, opaque relevance ranking, and sometimes weak metadata. We still pay substantial sums for them and promote them as a matter of course. The objection to AI tools cannot simply be that they are commercial and imperfect, because by that standard half the collection budget becomes difficult to defend. <strong> </strong></p><h2>The \"abusing trust\" argument</h2><p>There is a related claim that deserves a direct response: librarians who teach users how to use AI search tools are abusing professional trust because the tools are imperfect and can lead to errors.</p><p>This is a bad argument dressed up as an ethical one. It rests entirely on the premise that AI search tools are imperfect, as though that distinguished them from anything else we teach.</p><p>e.g. Name the flawless tool, we libraries promote?</p><p><em>Our job, properly understood, is to teach people to use imperfect tools well, </em>with appropriate scepticism and a clear understanding of what each tool can and cannot do. Refusing to teach AI search tools because they are flawed is not an act of professional integrity. It is an abdication of the actual job.</p><p>It also leaves users to figure these tools out on their own. That is the worse outcome by every measure.</p><h2>The badly understood \"Stochastic Parrots\" argument</h2><p>A common argument among librarians goes something like this: Emily Bender says LLMs are stochastic parrots; therefore, LLMs can never be useful.</p><p>There is an immediate problem with that argument. Even if you accept the \"stochastic parrot\" description, it does not tell you whether LLMs combined with other technologies can be useful. It says nothing directly about retrieval-augmented generation, tool use, calculators, citation validators, structured workflows, human review, or other harnesses wrapped around the model.</p><p>The more damaging problem is that <a href=\"https://medium.com/@emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11\">Bender herself has clarified that \"stochastic parrot\" is not an argument that LLMs are useless</a>. In her account, it is not even an empirical hypothesis. It is a description or metaphor for systems that generate fluent linguistic form without grounding in communicative intent, a model of the world, or a model of the reader's state of mind<a class=\"footnote-anchor\" data-component-name=\"FootnoteAnchorToDOM\" id=\"footnote-anchor-6\" href=\"#footnote-6\" target=\"_self\">6</a>. </p><p>This does not mean Bender thinks LLMs are broadly useful. <a href=\"https://www.fastmail.com/digitalcitizen/exploring-ai-with-emily-m-bender/\">Her position is far more sceptical than that</a>. She has warned that synthetic text is not an information source, and that <a href=\"https://buttondown.com/maiht3k/archive/information-literacy-and-chatbots-as-search/\">using chatbots as reliable sources of knowledge is a serious category mistake.</a></p><p>But <a href=\"https://www.fastmail.com/digitalcitizen/exploring-ai-with-emily-m-bender/\">she has acknowledged limited possible uses, including \"tip-of-the-tongue\" search, language-learning dialogue partners, non-player characters in games, and non-generative uses of language models in classification, speech recognition and machine translation</a>. She treats summarisation more cautiously, because it can introduce material not present in the source. </p><p>Nor does this mean Bender has retreated from the stronger \"form versus meaning\" argument. In <a href=\"https://aclanthology.org/2020.acl-main.463/\">Bender and Koller's 2020 paper</a>, understanding is defined as mapping language to something outside language. Their claim is that a system trained only on linguistic form has no basis for learning that mapping, because it has access only to patterns in text, not to the extra-linguistic world those texts are about.</p><p>That is a serious argument. But it should not be flattened into the much weaker claim that LLMs can never be useful.</p><p>So the better conclusion is not \"stochastic parrots can never be useful\" (though she is currently very skeptical). It is: <strong>do not mistake fluent synthetic text for grounded understanding or reliable information</strong>. That is a much narrower, stronger, and useful warning but does not address the question on usefulness.</p><p>But it leaves room to ask the question that actually matters for librarians: under what conditions, with what scaffolding, for which tasks, and with what verification, can LLM-based systems be made useful rather than misleading?</p><p></p><h2>The lesson</h2><p>The lesson from Google Scholar is not that librarians should embrace every flawed tool users like. It is that \"flawed\" and \"useless\" are not synonyms.</p><p>It is hard to compare like for like, but I think it is fair to say that the practical gains in LLM-powered tools from 2023 to 2026 have been faster and larger than Google Scholar's improvements across its first decade. But the more important point is not the scale of improvement. It is that Google Scholar's improvement did not fundamentally fix its transparency problem. Instead, users learnt that the flaw was either less fatal than it first appeared, or manageable with the right habits.</p><p>That is the lesson librarians need to take seriously now.</p><p>If the objection is environmental cost, make the environmental argument. If it is labour exploitation, make the labour argument. If it is vendor lock-in, inequity, surveillance, weak governance, or commercial capture, make those arguments. They are serious enough to stand on their own.</p><p>They do not need a backdoor return to the claim that AI tools cannot really be useful.</p><p>That move is increasingly unconvincing. It usually begins with a reluctant concession: of course AI can sometimes be useful. Then, when the discussion turns to teaching, adoption or institutional support, the old premise quietly reappears. The tools are flawed, so using them must be irresponsible.</p><p class=\"button-wrapper\" data-attrs=\"{&quot;url&quot;:&quot;https://ko-fi.com/aarontay&quot;,&quot;text&quot;:&quot;Buy me coffee via ko-fi!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}\" data-component-name=\"ButtonCreateButton\"><a class=\"button primary\" href=\"https://ko-fi.com/aarontay\"><span>Buy me coffee via ko-fi!</span></a></p><p></p><p>But that is not how librarians treat tools.</p><p>We teach flawed systems all the time. We teach Google Scholar while warning about coverage and metadata. We teach Scopus and Web of Science while explaining their selectivity. We teach discovery layers while knowing their indexing and ranking are imperfect.</p><p>The professional act is not pretending tools are flawless. It is teaching people where they help, where they fail, and how to verify what matters.</p><p>So reject an AI tool because the cost is too high, the governance is too weak, the evidence is too thin, or the institutional incentives are wrong.</p><p>Just say that.</p><p>Do not dress those objections up as proof that the tool can never be useful. That argument has already lost.</p><p>The question now is not whether AI search tools can be useful. It is which uses are worth the cost, which are not, and what role librarians should play in helping users tell the difference.</p><p>  </p><div class=\"subscription-widget-wrap-editor\" data-attrs=\"{&quot;url&quot;:&quot;https://aarontay.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}\" data-component-name=\"SubscribeWidgetToDOM\"><div class=\"subscription-widget show-subscribe\"><div class=\"preamble\"><p class=\"cta-caption\">Thanks for reading Aaron Tay's Musings about Librarianship! Subscribe for free to receive new posts and support my work.</p></div><form class=\"subscription-widget-subscribe\"><input type=\"email\" class=\"email-input\" name=\"email\" placeholder=\"Type your email\u2026\" tabindex=\"-1\"><input type=\"submit\" class=\"button primary\" value=\"Subscribe\"><div class=\"fake-input-wrapper\"><div class=\"fake-input\"></div><div class=\"fake-button\"></div></div></form></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-1\" href=\"#footnote-anchor-1\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">1</a><div class=\"footnote-content\"><p>There are many similarities to the Innovator's Dilemma argument. Early users may value dimensions that experts discount, and a tool that performs poorly against established professional criteria may still become useful enough to reshape practice. But unlike the Innovator's Dilemma, I refer to cases where the alternatives co-exist.</p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-2\" href=\"#footnote-anchor-2\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">2</a><div class=\"footnote-content\"><p>Google Scholar offered no stable, auditable list of indexed sources or records, and even <a href=\"https://scholar.google.com/scholar/help.html#coverage\">Google's own guidance effectively tells users to test coverage by sampling titles rather than consult a definitive inventory. </a></p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-3\" href=\"#footnote-anchor-3\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">3</a><div class=\"footnote-content\"><p>As confirmed by many studies. The amount of full-text indexed by Google Scholar is also believed by many to be unmatched.</p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-4\" href=\"#footnote-anchor-4\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">4</a><div class=\"footnote-content\"><p>To be clear, Bender is not being cited here as making the same claim as Marcus or Mitchell that LLMs are broadly useful. <a href=\"https://www.fastmail.com/digitalcitizen/exploring-ai-with-emily-m-bender/\">Her position is much narrower and more sceptical.</a> In <a href=\"https://www.fastmail.com/digitalcitizen/exploring-ai-with-emily-m-bender/\">interviews</a>, She has said that safe and beneficial uses of synthetic text are hard to identify, but has offered tentative examples <em>especially \"tip of the tongue\" search,</em> where a user describes something in order to recover the name of it and can then verify it through ordinary search. She also distinguishes text generation from other uses of language models, saying that language models can have positive uses in classification, automatic speech recognition, and machine translation, while treating summarisation as more borderline because it can introduce material not present in the source. The point here is therefore not that Bender endorses LLMs as broadly useful, but that \"stochastic parrot\", in her own account, is not an empirical hypothesis or an argument that LLMs have no possible utility. It is a description or metaphor for language-mimicking systems, and the 2021 paper was about the risks and harms of pursuing ever-larger language models, not a general paper about \"AI\".</p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-5\" href=\"#footnote-anchor-5\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">5</a><div class=\"footnote-content\"><p>The improvement of LLMs between 2022 to 2026 is far larger than from 2004 to 2024! This improvement comes from both improvements in the models as well as the use of harnesses like Claude Code to combine deterministic code with LLM flexibility.</p></div></div><div class=\"footnote\" data-component-name=\"FootnoteToDOM\"><a id=\"footnote-6\" href=\"#footnote-anchor-6\" class=\"footnote-number\" contenteditable=\"false\" target=\"_self\">6</a><div class=\"footnote-content\"><p>To be clear, this does not mean Bender has retreated from the stronger \"form versus meaning\" argument. In her account, the closest thing to an argument in this area is <a href=\"https://aclanthology.org/2020.acl-main.463/\">Bender and Koller's 2020 paper</a>, which defines understanding as mapping language to something outside language. Their claim is that a system trained only on linguistic form has no basis for learning that mapping, because it only has access to patterns in text, not to the extra-linguistic world those texts are about. This is separate from the \"stochastic parrots\" phrase itself, which Bender describes as a metaphor rather than an empirical hypothesis. She also notes that multimodal systems complicate the picture: image-text models may meet the Bender and Koller definition of understanding in a very thin sense, because they can map between linguistic strings and images. But she argues that the stochastic-parrot framing remains relevant to such systems and to systems built around them.</p><p></p></div></div>","doi":"https://doi.org/10.59350/xjc74-4s752","guid":"200014373","image":"https://substackcdn.com/image/fetch/$s_!ciEX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0530b4d-0b78-4409-adee-31a9dd7e9389_1134x651.png","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1781481600,"rid":"a0bcx-c8490","summary":"What 2004 can teach us about 2024 \u2014 and the librarians who keep getting the lesson wrong","tags":["Llm","Ai Search"],"title":"Learning from Google Scholar and why a tool does not need to be flawless to be useful","updated_at":1781542041,"url":"https://aarontay.substack.com/p/learning-from-google-scholar-and","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Dingemanse","given":"Mark"}],"blog":{"authors":null,"community_id":"ac7a6214-f166-416e-9500-caa8343d6285","created":1780876800,"current_feed_url":null,"description":"Sounding out ideas on language, interaction, and iconicity","favicon":"https://rogue-scholar.org/api/communities/ac7a6214-f166-416e-9500-caa8343d6285/logo","feed_format":"application/atom+xml","feed_url":"https://ideophone.org/feed/atom/","filter":"category:98","generator":"WordPress","home_page_url":"https://ideophone.org","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"ideophone","status":"active","subfield":"1203","title":"The Ideophone","updated":1781539498,"use_api":true},"blog_name":"The Ideophone","blog_slug":"ideophone","content_html":"<p>Note to readers: some of these ideas made it into a commentary I wrote with Christine Cuskley:</p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#bcd9e670\">Dingemanse, Mark &amp; Cuskley, Christine (in press). For robust research, center values, not technology. <em>Behavioral and Brain Sciences</em>. Preprint doi: <a href=\"https://doi.org/10.5281/zenodo.18944023\" target=\"_blank\" rel=\"noreferrer noopener\">10.5281/zenodo.18944023</a></p>\n\n\n\n<p>One topic that often comes up when discussing <a href=\"https://ideophone.org/generative-ai-and-research-integrity/\" data-type=\"post\" data-id=\"8271\">LLM technology in relation to research integrity</a> is one that I will describe as <em>seeking permission</em>. When looking at the ethical, legal, and societal harms imposed by LLMs (<a href=\"https://hcommons.org/?get_group_doc=1005140/1757881623-Guest_etal_2025.pdf\">and there are many</a>), sometimes people feel the message ends up altogether too negative. How about this use case I heard of? Aren&#8217;t some people getting something useful out of it? Surely we can&#8217;t ban the tech outright?</p>\n\n\n\n<p>Often this is phrased as a concern about messaging (people won&#8217;t accept it if we tell them how bad it is; you need to sugarcoat it by also mentioning something nice). Sometimes it is phrased as a majority argument: it&#8217;s already here, everyone is using it, surely it can&#8217;t be that bad? (Smoking would like a word.) Sometimes it is a concern about missing the boat: these are skills we need, telling people not to use it is like telling them to go back to quill pens. <sup><a href=\"#footnote_1_8781\" id=\"identifier_1_8781\" class=\"footnote-link footnote-identifier-link\" title=\"Quick rhetorical intervention here: I don&rsquo;t hide my distaste of Big Tech LLMs, but I rarely suggest to ban them, so when I get this quip I typically bounce it back: what made you think I suggested that? There&rsquo;s a good conversation to be had here, but it is not about quill pens. It is about the moral distress that a response like this reveals.\">1</a></sup></p>\n\n\n\n<p>What I think is going on in these kinds of cases is that the central thrust of the argument is being missed. <strong>When it comes to research integrity, the key is a values-first perspective rather than a tech-first perspective.</strong></p>\n\n\n\n<p>A values-first perspective asks: how can we best uphold the values and standards that make our research robust, reproducible and future proof? A tech-first perspective asks: yeah, but how can I use this technology? It puts technology above values. It seeks permission but sidesteps the question of values.</p>\n\n\n\n<h2 class=\"heading\" class=\"wp-block-heading\">Can I use an LLM for &#8230;</h2>\n\n\n\n<p>One example that came up in a <a href=\"https://ideophone.org/on-generative-ai-and-reproducibility/\" data-type=\"post\" data-id=\"8740\">recent session</a> started with <em>literature review</em>: surely it&#8217;s not too harmful, a questioner said, to use an LLM for a first stab at a literature review? Multiple participants pushed back on this, saying that actually, LLMs don&#8217;t reliably summarise. Also, they provide only the most average consensus view; we don&#8217;t know what&#8217;s being left out. And LLMs by nature regurgitate without understanding; can we actually identify confidently produced bullshit in a field that we don&#8217;t master fully? Further, reading is a hard-won skill: tracing arguments, spending time with papers, separating substance and rhetorics; surely we don&#8217;t want to lose this skillset.</p>\n\n\n\n<p>A retreat followed: well then, maybe literature review was not the best example, but surely there is some other kind of use that is okay? Someone mentioned programming. Folks brought up deskilling, the security risks of vibe-coding, technological debt. Oh well, that&#8217;s not the kind of programming I meant. And so on.</p>\n\n\n\n<p>We can play this game all day: <a href=\"https://anthonymoser.github.io/writing/ai/haterdom/2025/08/26/i-am-an-ai-hater.html\">seek permission</a> for an edge case, retreat to another case when trouble arises, appeal to amiability or tolerance. Two things to note when a conversation gets to this point. First, it&#8217;s a misrepresentation: reluctance to grant permission is portrayed as an intolerant move, a wish to take the other&#8217;s toys, when it is nothing like that. What do you need my permission for, anyway? Second, it&#8217;s a distraction: it moves the conversation away from values and towards a tech-first perspective.</p>\n\n\n\n<h2 class=\"heading\" class=\"wp-block-heading\">Values: a matrix for mindful choice</h2>\n\n\n\n<p>This is where a values-first perspective can help cut through the knots. The approach is not to recommend or forbid particular tech products, or particular uses; it is to provide a matrix for mindful choice. For any attested or conceivable use, you can ask: how would this help or hinder my upholding of high standards of research integrity? If we consider the pushback voiced in response to the &#8216;literature review&#8217; use case, you can see they appeal to the same core values:</p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>honesty</em> (reporting processes accurately and being open about margins of uncertainty)</li>\n\n\n\n<li><em>scrupulousness</em> (being precise and thoughtful)</li>\n\n\n\n<li><em>transparency </em>(showing one's process and allowing others to build on it)</li>\n\n\n\n<li><em>independence</em> (being impartial and unswayed by commercial or political interests) </li>\n\n\n\n<li><em>responsibility</em> (taking into account the environment; accepting accountability for the statements made)</li>\n</ul>\n\n\n\n<p>These values do not come out of thin air; they&#8217;re straight from a widely adopted code of conduct for research integrity (NCCRI, 2018). They are also not controversial; most scientists will recognise them as principles that characterize robust research. I&#8217;ve written about them before, e.g. in my <a href=\"https://ideophone.org/generative-ai-and-research-integrity/\" data-type=\"post\" data-id=\"8271\">guidance on GenAI</a> and in my post on why <a href=\"https://ideophone.org/why-synthetic-text-is-incompatible-with-science-blogging/\" data-type=\"post\" data-id=\"8516\">synthetic text has no place in science blogging</a>.</p>\n\n\n\n<p>If you use these values as a compass to steer by, it&#8217;s easier to navigate the landscape of technology use. On the other hand, if you find yourself seeking permission, one useful thing to do is to step back and inspect the underlying value conflict. </p>\n\n\n\n<p>As you move from a tech-first to a values-first perspective, the question shifts from &#8220;won&#8217;t you give me permission?&#8221; to &#8220;how do I do the best science possible?&#8221;. And that, to me, is a question worth asking.</p>\n\n\n\n<h2 class=\"heading\" class=\"wp-block-heading\">Further reading</h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https://anthonymoser.github.io/writing/ai/haterdom/2025/08/26/i-am-an-ai-hater.html\">I am an AI Hater</a>, by Anthony Moser</li>\n\n\n\n<li><a href=\"https://zenodo.org/records/17065099\">Against the Uncritical Adoption of &#8216;AI&#8217; technologies in academia</a>, by Olivia Guest and collaborators</li>\n\n\n\n<li><a href=\"https://www.nwo.nl/en/netherlands-code-of-conduct-for-research-integrity\">Netherlands Code of Conduct for Research Integrity, 2018</a></li>\n</ul>\n\n\n\n<p></p>\n<ol class=\"footnotes\"><li id=\"footnote_1_8781\" class=\"footnote\">Quick rhetorical intervention here: I don&#8217;t hide my distaste of Big Tech LLMs, but I rarely suggest to ban them, so when I get this quip I typically bounce it back: what made you think I suggested that? There&#8217;s a good conversation to be had here, but it is not about quill pens. It is about the moral distress that a response like this reveals.<span class=\"footnote-back-link-wrapper\">[<a href=\"#identifier_1_8781\" class=\"footnote-link footnote-back-link\">&#8617;</a>]</span></li></ol>","doi":"https://doi.org/10.59350/cwfrp-k7f11","guid":"https://ideophone.org/?p=8781","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1765584000,"rid":"1tyst-nr357","summary":"Note to readers: some of these ideas made it into a commentary I wrote with Christine Cuskley: Dingemanse, Mark &amp; Cuskley, Christine (in press). For robust research, center values, not technology.","tags":["Academia","Most Read","Writing","Generative AI"],"title":"Don't seek permission, center values","updated_at":1781541999,"url":"https://ideophone.org/dont-seek-permission-center-values/","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>In language, semantics describe the names and meanings of words. The\nbioinformatics community has aptly adopted <em>biosemantics</em> as a concept that\nencompasses the issues with the names and meanings of biological entities,\nusually in natural language processing and data integration. However, semantics\ndoes not capture the context of words, and <em>biosemantics</em> fails to describe the\nbiological context and complex relationships between biological entities.</p>\n<p><img alt=\"Semantics versus Pragmatics\" height=\"300px\" src=\"https://pediaa.com/wp-content/uploads/2018/08/Difference-Between-Semantics-and-Pragmatics_Figure-1.png\"/></p>\n<p>Pragmatics goes beyond semantics and describes the context of words. Because of\nthis parallelism, I've begun to use the term <em>biopragmatics</em> to describe the\nfamily of computational approaches aimed at identifying and contextualizing the\ncontext of biological entities.</p>","doi":"https://doi.org/10.59350/j4vty-rrr02","guid":"https://cthoyt.com/2020/01/22/biosemantics-versus-biopragmatics","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1579651200,"rid":"6qadr-jvy42","summary":"In language, semantics describe the names and meanings of words. The bioinformatics community has aptly adopted biosemantics as a concept that encompasses the issues with the names and meanings of biological entities, usually in natural language processing and data integration. However, semantics does not capture the context of words, and biosemantics fails to describe the biological context and complex relationships between biological entities.","tags":["Semantics","Meta"],"title":"Biosemantics vs. Biopragmatics","updated_at":1781539932,"url":"https://cthoyt.com/2020/01/22/biosemantics-versus-biopragmatics.html","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>How many molecular biology papers have you read today? This week? This month? If\nyou're like me, its not so many, and we're falling behind very quickly. Here's a\nchart made by the <em>new</em> PubMed that summarizes how many papers were published\nmentioning RAS in the last 50 years.</p>\n<p><img alt=\"RAS Histogram\" src=\"https://cthoyt.com/img/ras_pubmed_history.png\"/></p>\n<p>There were 4,483 publications listed in 2019. We can't read that much, and even\nif we did, we couldn't remember it all. That's why we need to take the knowledge\nout of the unstructured text and store it in a structured form that can be read\nand stored in computers. This way, we can easily share it, query it, and write\nalgorithms that can help us reason about the incredible amount of biological\nknowledge out there.</p>\n<p>There are several formats in which this kind of information can be stored on a\ncontinuum between directly representing mechanistic biology to representing the\nknowledge itself. In the popular middle ground are BioPAX and BEL, which I'll\ncome back to in future posts.</p>\n<p>It's important to keep in mind that knowledge needs to be curated - this can\neither be manual, through natural language processing, or a mixture of both.\nI've written\n<a href=\"https://academic.oup.com/database/article/doi/10.1093/database/baz068/5521414\">a paper</a>\non such a process, but for now this post should motivate a few following ones\ndescribing what it takes to deal with nomenclature, build ontologies, and then\nstart extracting mechanistic biology from the literature.</p>","doi":"https://doi.org/10.59350/jt101-b1374","guid":"https://cthoyt.com/2020/01/23/encoding-biology-in-kgs","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1579737600,"rid":"qyhqj-cs605","summary":"How many molecular biology papers have you read today? This week? This month? If you're like me, its not so many, and we're falling behind very quickly. Here's a chart made by the new PubMed that summarizes how many papers were published mentioning RAS in the last 50 years.","tags":["Knowledge Graphs"],"title":"Encoding Biology in Knowledge Graphs","updated_at":1781539915,"url":"https://cthoyt.com/2020/01/23/encoding-biology-in-kgs.html","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>The other day I saw a tweet lamenting the drag that is literature review during\npreparation for writing your thesis.</p>\n<blockquote class=\"twitter-tweet\" data-partner=\"tweetdeck\"><p dir=\"ltr\" lang=\"en\">I just love writing 15 page literature reviews for graduate school courses on literally any topic except my thesis topic.</p>\u2014 PhD Diaries (@thoughtsofaphd) <a href=\"https://twitter.com/thoughtsofaphd/status/1225762592045649920?ref_src=twsrc%5Etfw\">February 7, 2020</a></blockquote>\n<p>I agree. I felt the same pain last fall when I wrote\n<a href=\"https://github.com/cthoyt/doctoral-thesis\">my doctoral thesis</a>. Luckily, I had\na strategy that made it a bit easier.</p>\n<p>I learned it from one of my professors when I was doing my master's degree in\nLife Science Informatics. Each semester, we had a seminar course in which each\nstudent was assigned research articles to read and present to the class with a\nshort slide deck. Later, I joined his research group and realized that this\ncourse served as a literature review for him just as much as us.</p>\n<p>So later when I was a Ph.D. student, I volunteered to run the seminar. I\nco-opted the concept, and planned the course to cover many of the topics I found\ninteresting for my thesis. I already knew some of the papers very well, and a\nfew were ones I had always been meaning to read. I tried to pick the most recent\npapers for topics when possible, but also threw in a few classics as well.</p>\n<p>On the first day of the seminar, I shared the following course information. I\nthought it was important to make clear what my expectations were for students in\nterms of their prior knowledge. Since they all came from the same master's\nprogram, I thought it was enough that they had passed one of the first semester\nlectures called \"Biological Databases\" which was about many of the resources and\ndatabases used in the systems and networks biology community. I also outlined\nwhat was the content for the course, what was expected, etc. then shared this\nall as a Google Doc so they could read it over and add comments.</p>\n<p>I also made a list of possible papers and a tentative schedule that students\ncould look over and decide which papers they found most interesting. The topics\nwere arranged in a logical order to tell the story of my thesis, and for each\nsection there were a few papers that I thought were very important, and a few\nextras just in case there was a lot of interest. During the first day of the\nseminar, I also went through the list of all papers and explained the topics to\nthe students. I gave them this list via Google Docs as well, and they were able\nto claim papers for their presentations. Below, I've listed the final list of\npapers and the order in which they were presented. We were able to come to\nagreements for all students to present the papers I found most important. Maybe\n40% of the class found a paper interesting and picked one the first day and the\nrest took the next week to decide, ask questions, or propose new papers.</p>\n<p>Another consideration I had when picking this paper list was to choose work done\nby my colleagues that I found interesting and helpful. After, I invited them to\ncome listen to the seminar and mediate discussion after. We were able to invite\none of my collaborators Mehdi Ali (he's a really good guy!) to discuss his work\non using deep learning for relation extraction in natural language processing. I\nthink that might have been the most engaging day of the whole series.</p>\n<p>I added one aspect to this course compared to the previous seminar that I had\nattended: each student was not only responsible for presenting the paper that\nhad been assigned from my list, but they were also responsible for finding a\nrelevant pre-print (in the same or similar topic) and submitting a peer review\nthrough the pre-print system. When I was a student, I noticed many students did\nnot read the references of the paper they were assigned in our seminars, and\nalso had not considered other similar research to their paper. Asking them to\nfind their own papers was a way to make this a more creative and fun process,\nand would directly prepare them to answer questions at the end of the\npresentation like \"what will the authors do next?\" or \"how will this research be\nused by others?\"</p>\n<p>One of the funny things that happened during the pre-print presentations is the\nstudents found several of mine and presented those. I suppose this was\ninevitable given the contemporary nature of my work in the context of the topics\nchosen. I would actually explicitly encourage students to check out my\npre-prints the next time I host a seminar, because I know the work very well and\ncould mediate a nice discussion.</p>\n<p>I learned a lot through the process of preparing this seminar. Its outline\nbecame the outline for my thesis, and a lot of the discussions became points\nthat I addressed explicitly in my writing. I wouldn't say that I was taking\nadvantage of the students in this process - we all benefited from the\nexperience. I hope you get some ideas about how you might be able to do this\nyourself, whether you're a doctoral student, a postdoc, or something else!</p>\n<h2 id=\"course-information\">Course Information</h2>\n<ul>\n<li>Title: Knowledge Assembly, Data Integration, and Modeling in Systems and\nNetworks Biology</li>\n<li>Period: Winter Semester 2018/2019</li>\n<li>Location: Endenicher Allee 19A, Room U.105 on Wednesdays 13.00-14.30</li>\n</ul>\n<h3 id=\"qualifications\">Qualifications</h3>\n<p>Students should be comfortable with the material presented in the Biological\nDatabases lecture during the first semester of the LSI curriculum.</p>\n<h3 id=\"goal\">Goal</h3>\n<p>Students will have the opportunity to practice reading, presenting, and\ndiscussing recent biomedical literature on the topics of knowledge assembly,\ndata integration, and modeling in systems and networks biology.</p>\n<h3 id=\"content\">Content</h3>\n<p>Students will be assigned papers and present on the holistic process of\nknowledge discovery in systems and networks biology that focus on the topics of\nknowledge assembly (e.g., natural language processing, modeling formalisms and\nformats, reasoning techniques), data integration (e.g., practical scenarios\nfocusing on techniques on the data level, knowledge level, and analytical\nlevels), and modeling strategies (e.g., rule-based modeling, agent-based\nmodeling, mathematical modeling, hypothesis generation with knowledge-based\napproaches).</p>\n<h3 id=\"assignment\">Assignment</h3>\n<p>Students will be assigned an article to read and present during a thirty (30)\nminute lecture. One goal of this lecture is to show an understanding of not only\nthe material presented in the article, but also the relevant background\ninformation - this may entail following the references and reading other\narticles. Another goal is to not only educate, but entertain the audience.\nStudents will also be expected to find a relevant pre-print article on arXiv,\nbioRxiv, or other pre-print server and post a peer-review for the author on the\ncorresponding service. Following the presentation of their assigned article,\nstudents should include slides (1-3) briefly explaining the relevance of the\npre-print that they found.</p>\n<h2 id=\"method-of-performance-review\">Method of Performance Review</h2>\n<p>Students will be assessed on the understanding of their assigned topic, the\nquality of their presentation, and their participation. Students missing more\nthan 2 seminars will not pass the course without a doctor's note.</p>\n<h2 id=\"schedule\">Schedule</h2>\n<h3 id=\"week-0---october-10th-2018---syllabus-week\">Week 0 - October 10th, 2018 - Syllabus Week</h3>\n<p>This week there will a short discussion of the syllabus and no presentation. For\nthose in Bonn that aren't aware of this wonderful tradition, welcome to Syllabus\nWeek.</p>\n<h3 id=\"week-1---october-31st-2018---named-entity-recognition\">Week 1 - October 31st, 2018 - Named Entity Recognition</h3>\n<p>Mubassher Leser, U., &amp; Hakenberg, J. (2005).\n<a href=\"https://doi.org/10.1093/bib/6.4.357\">What makes a gene name? Named entity recognition in the biomedical literature</a>.\nBriefings in Bioinformatics, 6(4), 357\u2013369.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2017/03/08/115022</p>\n<p>Bachman, J. A., Gyori, B. M., &amp; Sorger, P. K. (2018).\n<a href=\"https://doi.org/10.1186/s12859-018-2211-5\">FamPlex: A resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining</a>.\n<em>BMC Bioinformatics</em>, 19(1), 1\u201314.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/07/29/379446</p>\n<h3 id=\"week-2---november-7th-2018---identifiers\">Week 2 - November 7th, 2018 - Identifiers</h3>\n<p>Laibe, C., &amp; Le Nov\u00e8re, N. (2007).\n<a href=\"https://doi.org/10.1186/1752-0509-1-58\">MIRIAM Resources: tools to generate and resolve robust cross-references in Systems' Biology</a>.\n<em>BMC Systems Biology</em>, 1, 58.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2017/07/24/167619</p>\n<p>Juty, N., Le Nover\u0300e, N., &amp; Laibe, C. (2012).\n<a href=\"https://doi.org/10.1093/nar/gkr1097\">Identifiers.org and MIRIAM Registry: Community resources to provide persistent identification</a>.\n<em>Nucleic Acids Research</em>, 40(D1), 580\u2013586.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/02/14/101279</p>\n<h3 id=\"week-3---november-14th-2018---information-extraction\">Week 3 - November 14th, 2018 - Information Extraction</h3>\n<p>Novichkova, S., <em>et al.</em> (2003).\n<a href=\"https://doi.org/10.1093/bioinformatics/btg207\">MedScan, a natural language processing engine for MEDLINE abstracts</a>.\n<em>Bioinformatics</em>, 19(13), 1699\u20131706.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/08/29/403667</p>\n<p>Ali, M., <em>et al.</em> (2017).\n<a href=\"http://publica.fraunhofer.de/eprints/urn_nbn_de_0011-n-4972978.pdf\">Automatic Extraction of BEL-Statements based on Neural Networks</a>.\n<em>Proceedings of BioCreative VI Challenge and Workshop</em>, (October).</p>\n<p>Pre-print : https://osf.io/j76y3/</p>\n<h3 id=\"week-4---november-21nd-2018---knowledge-representations\">Week 4 - November 21nd, 2018 - Knowledge Representations</h3>\n<p>Demir, E., <em>et al.</em> (2010).\n<a href=\"https://doi.org/10.1038/nbt1210-1308c\">The BioPAX community standard for pathway data sharing</a>.\n<em>Nature Biotechnology</em>, 28(12), 1308\u20131308.</p>\n<p>Pre-print: https://www.biorxiv.org/content/10.1101/192856v1</p>\n<p>Hucka, M., <em>et al.</em> (2003).\n<a href=\"http://www.ncbi.nlm.nih.gov/pubmed/12611808\">The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models</a>.\n<em>Bioinformatics (Oxford, England)</em>, 19(4), 524\u201331.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/01/23/246470</p>\n<h3 id=\"week-5---november-28th---knowledge-representations-cont\">Week 5 - November 28th - Knowledge Representations (cont\u2026)</h3>\n<p>Le Nov\u00e8re, <em>et al.</em> (2009).\n<a href=\"https://doi.org/10.1038/nbt.1558\">The Systems Biology Graphical Notation</a>.\n<em>Nature Biotechnology</em>, 27(8), 735\u201341.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/01/30/256750</p>\n<p>Carbon, S., <em>et al.</em> (2017).\n<a href=\"https://doi.org/10.1093/nar/gkw1108\">Expansion of the gene ontology knowledgebase and resources: The gene ontology consortium</a>.\n<em>Nucleic Acids Research</em>, 45(D1), D331\u2013D338.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/10/07/437020</p>\n<h3 id=\"week-6---december-12th-2018---pathway-databases-and-semantic-data-integration\">Week 6 - December 12th, 2018 - Pathway Databases and Semantic Data Integration</h3>\n<p>Croft, D., <em>et al.</em> (2014).\n<a href=\"https://doi.org/10.1093/nar/gkt1102\">The Reactome pathway knowledgebase</a>.\n<em>Nucleic Acids Research</em>, 42(D1), D472\u2013D477. <strong>AND</strong> Fabregat, A., <em>et al.</em>\n(2018).\n<a href=\"https://doi.org/10.1093/nar/gkx1132\">The Reactome Pathway Knowledgebase</a>.\n<em>Nucleic Acids Research</em>, 46(D1), D649\u2013D655.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/10/09/375097</p>\n<p>Cerami, E. G., <em>et al.</em> (2011).\n<a href=\"https://doi.org/10.1093/nar/gkq1039\">Pathway Commons, a web resource for biological pathway data</a>.\n<em>Nucleic Acids Research</em>, 39(SUPPL. 1), 685\u2013690.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/09/03/353235</p>\n<p>Khatri, P., Sirota, M., &amp; Butte, A. J. (2012).\n<a href=\"https://doi.org/10.1371/journal.pcbi.1002375\">Ten years of pathway analysis: Current approaches and outstanding challenges</a>.\n<em>PLoS Computational Biology</em>, 8(2).</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/09/13/416131</p>\n<p>Gligorijevi\u0107, V., &amp; Pr\u017eulj, N. (2015).\n<a href=\"https://doi.org/10.1098/rsif.2015.0571\">Methods for biological data integration: perspectives and challenges</a>.\n<em>Journal of The Royal Society Interface</em>, 12(112), 20150571.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/12/13/358390</p>\n<h3 id=\"week-8---january-16th-2019---applications\">Week 8 - January 16th, 2019 - Applications</h3>\n<p>Saqi, M., <em>et al.</em> (2018).\n<a href=\"https://doi.org/10.1093/bib/bby025\">Navigating the disease landscape: knowledge representations for contextualizing molecular signatures</a>.\n<em>Briefings In Bioinformatics</em>, (May), 1\u201315.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/11/23/475202</p>\n<p>Himmelstein, D. S., <em>et al.</em> (2017).\n<a href=\"https://doi.org/10.7554/eLife.26726\">Systematic integration of biomedical knowledge prioritizes drugs for repurposing</a>.\n<em>ELife</em>, 6.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/10/13/442640</p>\n<h3 id=\"week-9---january-23rd-2019---applications\">Week 9 - January 23rd, 2019 - Applications</h3>\n<p>Lopez, C. F., <em>et al.</em> (2013).\n<a href=\"https://doi.org/10.1038/msb.2013.1\">Programming biological models in Python using PySB</a>.\n<em>Molecular Systems Biology</em>, 9(646), 646.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/12/23/503359</p>\n<p>Gyori, B. M., <em>et al.</em> (2017).\n<a href=\"https://doi.org/10.15252/msb.20177651\">From word models to executable models of signaling networks using automated assembly</a>.\n<em>Molecular Systems Biology, 13(11)</em>, 954.</p>\n<p>Pre-print: https://www.biorxiv.org/content/early/2018/05/15/322156</p>","doi":"https://doi.org/10.59350/4t3sf-mab09","guid":"https://cthoyt.com/2020/02/09/seminar-for-thesis-writing","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1581206400,"rid":"9h7sp-08z86","summary":"The other day I saw a tweet lamenting the drag that is literature review during preparation for writing your thesis.","tags":["Doctoral Thesis","Teaching"],"title":"Host a Graduate Seminar Before Writing Your Thesis","updated_at":1781539913,"url":"https://cthoyt.com/2020/02/09/seminar-for-thesis-writing.html","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>We've all been there. You started a new branch from master. You had a very\nspecific goal in mind, <strong>The Original Goal</strong>. You made a pull request (PR) to go\nwith it, too, <strong>The Original Pull Request</strong>. But then, you had an idea! And\nalso, someone on your team asked you to solve another problem! Now the original\ncode you wrote to address <strong>The Original Goal</strong> relies on that code \u2026 and now\nyou've got dozens of files changed, hundreds of lines of diff, and nobody\n(including you) can understand what you've done. Like I said, we've all been\nthere. Here's what you can do to fix it:</p>\n<h2 id=\"1-stop-and-relax\">1. Stop and Relax</h2>\n<p>Don't do anything rash. Git is a pain to use, and you're going to have to rely\non it to keep a history for you of what you've done.</p>\n<h2 id=\"2-summarize\">2. Summarize</h2>\n<p>First, you're going to have to take a big step back. Write a summary of all the\nthings you've done in <strong>The Original Pull Request</strong>. This should be about <em>what</em>\nthe PR does and <em>why</em> it does it. Of course it could vary depending on the\nsituation, but this summary shouldn't be about exactly how the PR does it,\nbecause the implementation details are likely what lead to this situation in the\nfirst place.</p>\n<p>Keep in mind that every PR has a box at the top that's used to describe what's\nin it. This is where you will put your summary.</p>\n<h2 id=\"3-assessing-dependencies\">3. Assessing Dependencies</h2>\n<p>Of all the things that <strong>The Original Pull Request</strong>, some of them are\nself-contained, and some of them rely on each other. It was probably the case\nthat to accomplish <strong>The Original Goal</strong>, you had to address lots of smaller\ngoals. You probably also had to change lots of code and write new code too.</p>\n<p>Wouldn't it have been nice if all of these implementations were already done,\nbecause then you could have just solved <strong>The Original Goal</strong> directly by\nusing/applying previous code. That's what we're going to aim for.</p>\n<p>But first, you need to figure out which things you did relied on which other\nones, because you're going to break <strong>The Original Pull Request</strong> up until it\nexactly matches up to addressing <strong>The Original Goal</strong>. don't have any</p>\n<h2 id=\"4-the-break-up\">4. The Break Up</h2>\n<p>After you understand which parts of <strong>The Original Pull Request</strong> depend on each\nother, pick one independent part of the code that accomplishes one sub-goal.\nSince you're not doing this to be a martyr, and we all know git is too\ncomplicated to <em>Do It Right</em>, you're going to copy/paste the files that are\nrelated to this change to your desktop*.</p>\n<h2 id=\"5-escape-the-madness\">5. Escape the Madness</h2>\n<p>Before continuing, you're going to make sure all of the code in your big messy\nbranch for <strong>The Original Pull Request</strong> is committed and pushed. Even though we\nwant to supersede what's there, it never hurts to keep track of your descent\ninto madness.</p>\n<p>After there's nothing lying around, switch back to master. If your team has\ntaken good care of your repository, the master branch should be undisturbed by\nthe chaos you've created in <strong>The Original Pull Request</strong>. Make a new branch\nfrom master, and name it appropriately for fixing the one sub-goal, from here\nout known as <strong>The Sub-Goal</strong> that you identified in Step 4. Now you can start\nupdating the relevant files in your repository based on the files you copied to\nyour desktop. I suggest you don't copy/paste the contents of the whole files,\nbecause you might have forgotten about something else you changed in them. After\nall, you're reading my guide because this was a mess.</p>\n<h2 id=\"6-the-new-pull-request\">6. The New Pull Request</h2>\n<p>Once you've finished making the new branch for your independent part of code\nthat solves <strong>The Sub-Goal</strong>, you can make <strong>The New Pull Request</strong>.</p>\n<p>You will now go through the entire process of writing a good summary of this\nbranch for your co-developers, you will get their feedback, you will make\nupdates, pass flake8, and so on. They will thank you for having code that\naccomplishes one thing, and can be described simply. They will thank you for not\nhaving too big of a diff, and for the things in the diff all being relevant and\nimportant. Then you can merge this branch into master.</p>\n<h2 id=\"7-newfound-wisdom\">7. Newfound Wisdom</h2>\n<p>Throughout transferring the code for <strong>The New Pull Request</strong> you have probably\nrealized there are some things you did back in <strong>The Original Pull Request</strong>\nthat you could do better, and made some updates in the code in <strong>The New Pull\nRequest</strong> to reflect the wisdom you've gained along the way. That's great!\nCongratulations!</p>\n<p>After your team has approved <strong>The New Pull Request</strong>, you can merge it into\nmaster and both delete the branch locally and on the remote. Then you should\nswitch back to the master branch. You can pull from master, and see your code\nthat solved <strong>The Sub-Goal</strong> reflected here.</p>\n<h2 id=\"8-the-hard-part\">8. The Hard Part</h2>\n<p>This is the hard part. Now you have to switch back to the branch for <strong>The\nOriginal Pull Request</strong>. Now you have to update this branch from master. It's\ngoing to be hard because now you've probably made different changes in <strong>The New\nPull Request</strong> than in <strong>The Original Pull Request</strong> so there will likely be\nconflicts.</p>\n<p>This is not a tutorial on how to solve merge conflicts. Use google to figure\nthat out</p>\n<p>I can't understate: <strong>do this part really well</strong>. If you don't, then the history\nin the original branch will be even more incomprehensible, and you won't be able\nto tell if you lost any of your original work. Please, please, please do this\nwell.</p>\n<p>P.S. Like I said before, don't be a martyr. Use tools like GitHub Desktop and\nPyCharm to help you merge. I heard that the git CLI was <em>allegedly</em> created by\nLinus Torvalds to slow other developers down.</p>\n<p>Why are we going through all of this pain, rather than just pushing your team to\nlet you merge <strong>The Original Pull Request</strong>? The reason you have to do this is\nbecause now all of the changes that addressed <strong>The Sub-Goal</strong> are part of\nmaster, and are no longer part of the diff of <strong>The Original Pull Request</strong>.</p>\n<p>Now you're one step closer to your team being able to understand, review, and\neventually merge <strong>The Original Pull Request</strong>.</p>\n<h2 id=\"9-the-frustrating-part\">9. The Frustrating Part</h2>\n<p>This is the frustrating part. After you've gone through all of that work to\nsplit a tiny part of <strong>The Original Pull Request</strong> into a smaller, independent\npull request, you're not done. You will probably have to repeat steps 4-8 a few\ntimes. You'll be tempted to throw away the branch for <strong>The Original Pull\nRequest</strong> and maybe start over.</p>\n<p>Don't do that.</p>\n<p>If you do, the same disorganization that lead to the mess of <strong>The Original Pull\nRequest</strong> might just slip back into whatever you do next. Even worse, nobody\nelse will be able to follow what you've done until now.</p>\n<p>So relax. This is going to take a few days. You're going to have to wait in\nbetween several iterations for feedback. That's good. You need feedback. I need\nfeedback. We all need to practice getting it and giving it. Embrace the\nopportunity to have your team help you improve your code, gain wisdom, and make\nyour contributions sustainable.</p>\n<h2 id=\"finishing-up\">Finishing Up</h2>\n<p>Eventually after several iterations of 4-9, you will have excised all of the\ncode that was important for <strong>The Original Pull Request</strong>, but not directly\naccomplishing <strong>The Original Goal</strong>. As you removed independent parts, new parts\nbecame independent themselves. Eventually, <strong>The Original Pull Request</strong> will\nindeed match up exactly to <strong>The Original Goal</strong>, then you will be able to come\nback to it for review and merging.</p>\n<p>I understand this is a frustrating process. The purpose of these steps were to\nhelp you think through a large piece of work you've done. You should be proud\nthat you've solved a complex problem with many intricate parts. It was a lot of\nextra work to break it into many pull requests, and it might have taken more of\nyour time the first time working through this process, but in the future, this\nmight help you to start with small tasks rather than addressing <strong>The Original\nGoal</strong> all at once. GitHub, for example, has an issue tracker that is very\nhelpful for this. I imagine that each issue should correspond to a <strong>Sub-Goal</strong>,\nand that each should have exactly one PR that addresses it. <strong>The Original\nGoal</strong> also deserves its own issue that points to all of the issues for its\nsub-goals. Eventually you will address this with a beautiful PR as well. Happy\ncoding!</p>\n<p>*If you're thinking, why don't I use cherry picking? If you know what cherry\npicking is in the context of git (and also how to use it) then you probably\nwon't have the issue that prompted this blog post. But also, you should go\noutside and pick some apples instead. Thanksgiving is never more than a few\nhundred days away. It pays to be ready.</p>\n<h2 id=\"afterword\">Afterword</h2>\n<p>It might be illustrative to see where an example of where this was done in\npractice, so I'll share some work I did with a text mining tool from Harvard\nMedical School, <a href=\"https://github.com/indralab/gilda\">Gilda</a>. It's a simple yet\npowerful system for grounding of named entities based on dictionary lookup.\nUnfortunately, it didn't include some dictionaries I wanted, and it didn't have\na UI to go with its web API.</p>\n<p>So I set out on figuring out how it generated dictionaries, where it stored\nthem, and how it loaded them to make the web app. I ended up making several\nmodifications to accomplish this goal, but it was a huge PR. I've definitely\nannoyed the author, <a href=\"https://github.com/bgyori\">@bgyori</a>, with PRs that are too\nbig before, which he was ultimately not able to understand or merge.</p>\n<p>Keep in mind, in your team, your teammates might be obligated to help you\nbecause you're working towards a common goal, getting paid, etc. When you're in\nthe open source world, nobody really owes you anything, so you it's in your best\ninterest to make things as easy as possible on the package's maintainer(s).</p>\n<p>So I made a few different pull requests that were all totally independent:</p>\n<ul>\n<li>Add constants for resource file paths\n<a href=\"https://github.com/indralab/gilda/pull/12\">#12</a></li>\n<li>Make API more reusable <a href=\"https://github.com/indralab/gilda/pull/13\">#13</a></li>\n<li>Make instantiation of Grounder more flexible\n<a href=\"https://github.com/indralab/gilda/pull/15\">#15</a></li>\n</ul>\n<p>Maybe you're seeing a theme here. I was improving lots of different bits of\nGilda so I could reuse the package in new code later. The next incremental\nincrease was:</p>\n<ul>\n<li>Refactor functionality from the GrounderInstance class into the Grounder class\n<a href=\"https://github.com/indralab/gilda/pull/16\">#16</a></li>\n</ul>\n<p>And finally with these in place, I realized that adding a web interface was\nparallel to my original goal, but not the core. What was really important was\nthat throughout all of the Gilda functionality, I could load my own synonym list\n(which I'd generate using the HPO, EFO, and DOID). I was able to address the UI\nwith:</p>\n<ul>\n<li>Add minimimal UI to web interface\n<a href=\"https://github.com/indralab/gilda/pull/19\">#19</a></li>\n</ul>\n<p>At the time of writing, we're still working through this PR. But all of it is\nleading up to the point where I can load my own files into this web interface.\nIt will seem so obvious to Ben when I send this PR next (but after giving him\nsome space\u2026 I did just bombard him with 5 PRs in a few days) what I am trying\nto accomplish and why.</p>\n<p>Want to see what happens when you try and do all of this in one PR? You will\ncorrectly guess that the PR is a total mess, impossible to understand, and\nriddled with questions that are really too big to answer when your head is\nalready so far in the sand. Behold, in all its infamy, my failed PR from last\nsummer (<a href=\"https://github.com/indralab/gilda/pull/4\">#4</a>). At this point, you\ncan't even see what a mess it was from the linked web page but if you go back\nthrough the version history before I broke it into 5 smaller PRs (using the\nworkflow described above) it was a monolith.</p>","doi":"https://doi.org/10.59350/3cn95-h9y94","guid":"https://cthoyt.com/2020/03/20/how-to-fix-your-monolithic-pull-request","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1584662400,"rid":"w8wqq-y9x71","summary":"We've all been there. You started a new branch from master. You had a very specific goal in mind, The Original Goal. You made a pull request (PR) to go with it, too, The Original Pull Request. But then, you had an idea! And also, someone on your team asked you to solve another problem!","tags":["Code With Me"],"title":"How to Fix Your Monolithic Pull Request","updated_at":1781539911,"url":"https://cthoyt.com/2020/03/20/how-to-fix-your-monolithic-pull-request.html","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>A few months ago, the question was posed on science Twitter: \"How many people\nhave published on <a href=\"https://chemrxiv.org/\">ChemRxiv</a>?\"</p>\n<blockquote class=\"twitter-tweet\" data-partner=\"tweetdeck\"><p dir=\"ltr\" lang=\"en\">makes me wonder about the stats at <a href=\"https://twitter.com/ChemRxiv?ref_src=twsrc%5Etfw\">@ChemRxiv</a> <a href=\"https://t.co/Ml5X8F4ckJ\">https://t.co/Ml5X8F4ckJ</a></p>\u2014 Egon Willigh\u24d0gen (@egonwillighagen) <a href=\"https://twitter.com/egonwillighagen/status/1219193083792969728?ref_src=twsrc%5Etfw\">January 20, 2020</a></blockquote>\n<p>It was a good day for me, which meant I was in the mood to take up the first\nchallenged posed on Twitter. I found that Fran\u00e7ois-Xavier Coudert\n(<a href=\"https://github.com/fxcoudert\">@fxcoudert</a>) has previously written a\n<a href=\"https://github.com/fxcoudert/tools/blob/master/chemRxiv/chemRxiv.py\">python client</a>\nfor ChemRxiv. I made a pair of pull requests\n(<a href=\"https://github.com/fxcoudert/tools/pull/9\">fxcoudert/tools#9</a> and\n<a href=\"https://github.com/fxcoudert/tools/pull/10\">fxcoudert/tools#10</a>) to fix some\nbugs and make it importable from other python modules.</p>\n<p>Unlike BioRxiv, the pre-print server for biology, ChemRxiv is implemented with\n<a href=\"https://figshare.com/\">FigShare</a>. It turns out that all FigShare \"institutions\"\nlike ChemRxiv are actually accessible through the main\n<a href=\"https://docs.figshare.com/\">FigShare API</a>. I think this is pretty cool, and\nmade sure that the ChemRxiv client that I had updated was actually able to be\nrun for any institution. Fun fact: the institution code for ChemRxiv is <code class=\"language-plaintext highlighter-rouge\">259</code>.</p>\n<p>I got to work writing my\n<a href=\"https://github.com/cthoyt/chemrxiv-summarize\">own repository</a> to wrap the\nclient, take care of downloading all of the bibliographic information available,\nand generating some pretty pictures. I originally ran the scripts and generated\npictures on January 20th, 2020 (the day Egon posed the question). Since the\npandemic has got the whole science community introspecting, I came back to this\ntoday and thought it might be worth writing up as a blog post.</p>\n<p>Without further ado, here are the most recent charts I've generated to answer\nthree main questions. I've linked the images in such a way that the charts will\nbe automatically updated with my GitHub repository. This also implicitly means\nthat there's a history of each image, but because two of them are plotting time\ncourse information, the history is already conveyed within the chart.</p>\n<h3 id=\"how-many-articles-were-contributed-each-month-to-chemrxiv\">How many articles were contributed each month to ChemRxiv?</h3>\n<p>How many papers were submitted each month to ChemRxiv? Keep in mind that the\ncurrent month is likely not complete.</p>\n<p><img alt=\"Articles per Month\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/articles_per_month.png\"/></p>\n<h3 id=\"how-many-unique-authors-contribute-each-month-to-chemrxiv\">How many unique authors contribute each month to ChemRxiv?</h3>\n<p>This only counts using the ORCID iDs of the first authors; it's pretty\ninconsistent what other identifying information is included in the metadata for\neach article.</p>\n<p><img alt=\"Unique Authors per Month\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/unique_authors_per_month.png\"/></p>\n<h3 id=\"how-many-author-submit-multiple-times-each-month\">How many author submit multiple times each month?</h3>\n<p>How many authors submitted more than once per month? This chart shows spikes in\nAugust, which I will guess is when most people are submitting before their\nsummer breaks :)</p>\n<p><img alt=\"Percent Duplicate Authors per Month\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/percent_duplicate_authors_per_month.png\"/></p>\n<h3 id=\"how-many-authors-submitted-for-their-first-time-each-month\">How many authors submitted for their first time each month?</h3>\n<p><img alt=\"First Time First Authors per Month\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/first_time_first_authors_per_month.png\"/></p>\n<h3 id=\"how-many-unique-first-authors-are-there-on-chemrxiv\">How many unique first authors are there on ChemRxiv?</h3>\n<p>How many first authors have historically contributed to ChemRxiv at each month?\nWe can take the first date of authorship for each author then count at each\nmonth how many unique first time authors there are. Then, we can use a\ncumulative sum to show how many authors have contributed to ChemRxiv at any\npoint in time.</p>\n<p><img alt=\"Historical Authorship\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/historical_authorship.png\"/></p>\n<h3 id=\"how-many-authors-are-prolific-on-chemrxiv\">How many authors are prolific on ChemRxiv?</h3>\n<p>If we aggregate the data, we can ask how many authors have submitted lots of\narticles:</p>\n<p><img alt=\"Author Prolificness\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/author_prolificness.png\"/></p>\n<h3 id=\"what-licenses-are-popular-on-chemrxiv\">What licenses are popular on ChemRxiv?</h3>\n<p>The following chart shows the popularity of different licenses over time. The\n<a href=\"https://creativecommons.org/licenses/by-nc-nd/4.0/\">CC BY-NC-ND 4.0 license</a> is\na resounding victor. You can learn about Creative Commons (CC) licenses\n<a href=\"https://creativecommons.org/licenses/\">here</a>.</p>\n<p><img alt=\"Historical Licenses\" src=\"https://raw.githubusercontent.com/cthoyt/chemrxiv-summarize/master/figshare/chemrxiv/historical_licenses.png\"/></p>\n<p>If you're interested to regenerate these charts yourself, you're welcome to do\nso with the following code:</p>\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>git clone https://github.com/cthoyt/chemrxiv-summarize\n<span class=\"nb\">cd </span>chemrxiv-summarize\npython 01_download.py\npython 02_process.py\npython 03_visualize.py\n</code></pre></div></div>\n<p>Downloading takes a bit of time (about 40 minutes) but there's a <code class=\"language-plaintext highlighter-rouge\">tqdm</code> bar to\nkeep you entertained in the mean time. Normally I package all of my code, but\nthe one off scripts here didn't seem to warrant it.</p>\n<p>As a final note, I'd like to shout out to Marshall Brennan\n(<a href=\"https://twitter.com/Organometallica\">@Organometallica</a>) for being an excellent\nspokesperson and public face of ChemRxiv. Also, throughout this process I\nrealized he also was a chemistry major in his bachelor's at Northeastern\nUniversity like me. Go huskies!</p>\n<hr/>\n<p>May 2020 Update: Fran\u00e7ois-Xavier Coudert created the\n<a href=\"https://chemrxiv-dashboard.github.io/\">ChemRxiv-Dashboard</a>, which makes some\nsimilar summaries to this. Check it out!</p>\n<blockquote class=\"twitter-tweet\" data-partner=\"tweetdeck\"><p dir=\"ltr\" lang=\"en\">I made a dashboard for <a href=\"https://twitter.com/ChemRxiv?ref_src=twsrc%5Etfw\">@ChemRxiv</a>, fed by the <a href=\"https://twitter.com/figshare?ref_src=twsrc%5Etfw\">@figshare</a><br/>metadata API.<a href=\"https://t.co/rKyAOGkrVO\">https://t.co/rKyAOGkrVO</a> <a href=\"https://t.co/fLfjEabraz\">pic.twitter.com/fLfjEabraz</a></p>\u2014 FX Coudert (@fxcoudert) <a href=\"https://twitter.com/fxcoudert/status/1262763710956793860?ref_src=twsrc%5Etfw\">May 19, 2020</a></blockquote>\n<p>November 2020 Update: I added a license chart and made some changes to enable\nthis repo to be much more easily used for other FigShare institutions. If you've\nfound this post from @figshare's\n<a href=\"https://twitter.com/figshare/status/1323762002293121025\">tweet</a> and want help\nmaking these charts for your FigShare institution, please feel free to @ me on\nTwitter or send me an email.</p>","doi":"https://doi.org/10.59350/n7rjr-90f02","guid":"https://cthoyt.com/2020/04/15/summarizing-chemrxiv","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1586908800,"rid":"3ky5b-cne42","summary":"A few months ago, the question was posed on science Twitter: \"How many people have published on ChemRxiv?\"","tags":["Bibliometrics"],"title":"Summarizing ChemRxiv","updated_at":1781539909,"url":"https://cthoyt.com/2020/04/15/summarizing-chemrxiv.html","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>We have a big problem in the bioinformatics community with namespaces,\nidentifiers, and names. And nobody's posed the question better than\n<a href=\"https://www.youtube.com/watch?v=U0CGsw6h60k\">Rihanna herself</a>.</p>\n<p>During my Ph.D. at Fraunhofer, one of the old text miners reminisced to me about\nthe late 90's and early naughties when they had to curate their own dictionaries\nof synonyms for entities. I was lucky enough to have joined the bioinformatics\ncommunity after excellent nomenclature resources like\n<a href=\"https://www.ebi.ac.uk/chebi/\">ChEBI</a> and the <a href=\"https://www.genenames.org/\">HGNC</a>\nwere established and accepted by the community as gospel.</p>\n<p>I consider these sources excellent because it's quite easy to get a list of the\nidentifiers and corresponding names that they maintain (TSV, etc.). There are\nother nomenclatures, like the\n<a href=\"https://cthoyt.com/2020/04/18/ooh-na-na.html/ftp://ftp.expasy.org/databases/enzyme/enzyme.dat\">ExPASy Enzyme Classes</a>, that\nare stored as text files in non-standard formats.</p>\n<p>The Open Biomedical Ontology (OBO) format and\n<a href=\"http://www.obofoundry.org/\">OBO Foundry</a> were first published in\n<a href=\"https://www.nature.com/articles/nbt1346\">2007</a> as a solution for standardizing\na growing set of biomedical ontologies that few shared semantics. Many ontology\nmaintainers adopted their format, or at least used the OWL to OBO converter\ntools to include their ontologies in a reusable format. However, there remain\nsome notable holdouts like the\n<a href=\"https://github.com/CLO-ontology\">Cell Line Ontology</a> that have not begun to\ndistribute their content as OBO.</p>\n<p>In parallel, the <a href=\"https://www.ebi.ac.uk/ols\">Ontology Lookup Service (OLS)</a> was\npublished as one of many front-ends for exploring this growing list of\nresources. In comparison, it may have been one of the first tools to provide a\nnice user experience that included a search engine (powered by\n<a href=\"http://www.obofoundry.org/\">solr</a>, because they're living in the Java world).</p>\n<p>Both are lacking - there does not exist a solid OBO ecosystem (though Martin\nLarralde's <a href=\"https://github.com/althonos/pronto\">pronto</a> may well soon change\nthat) and even worse, the content in OBO loosely follows the standard, at best.\nOn the other hand, the OLS has both an over-engineered interface that isn't\nquite user friendly. For example, if you want to look up programmed cell death\n(GO:0012501), you have to know the internal OLS key for the namespace and the\nPURL for the identifier, which is not so obvious. Then you can finally hit the\n<a href=\"https://www.ebi.ac.uk/ols/api/ontologies/go/terms?iri=http://purl.obolibrary.org/obo/GO_0012501\">API</a>.</p>\n<p>And still, both of them lack some of my favorite, and arguably most important\nnamespaces, like HGNC, RGD, MGI, UniProt, Entrez Gene, and PubChem. As an aside,\ndealing with PubChem is for people operating on a whole different level, so I'm\nnot blaming anyone for dropping the ball on that one. Later, I will confess to\ndoing the same.</p>\n<p>Even worse, the OBO Foundry and OLS can't even agree on what to call some\nnamespaces. A great example is the NCBI taxonomy database. On the NCBI site,\nthey say that the namespace is called <code class=\"language-plaintext highlighter-rouge\">NCBI</code> and compact uniform identifiers\n(CURIEs) should look like <code class=\"language-plaintext highlighter-rouge\">NCBI:txid175694</code>, OBO Foundry says the namespace is\n<code class=\"language-plaintext highlighter-rouge\">NCBITaxon</code> (one of the few notable mixed-case namespace names) and CURIEs\nshould look like <code class=\"language-plaintext highlighter-rouge\">NCBITaxon:175694</code>.</p>\n<p>Identifiers.org came along to solve some of these ambiguities with a curated\ndatabase, but it's missing lots of the things in OBO Foundry and OLS, and it\neven disagrees on others. They call the NCBI taxonomy namespace <code class=\"language-plaintext highlighter-rouge\">taxonomy</code> and\nsay that identifiers should look like <code class=\"language-plaintext highlighter-rouge\">taxonomy:175694</code>. Exhausting!</p>\n<p><img alt=\"Registry Comparison\" src=\"https://cthoyt.com/img/registry_comparison.svg\"/></p>\n<p>One more issue is the GOGO problem. Many OBO ontologies use local identifiers\nthat also include the prefix because a given ontology might contain terms\nimported from other ones. However, this means that ontologies that originated\nfrom the OBO world have redundant identifiers, like from GO (e.g.,\nGO:GO:0012501). I know what you're wondering: is Dr. Claw in charge? Maybe.</p>\n<hr/>\n<p>The reason I went down this rabbit hole is because I want to support people to\ndo better curation. This means I want them to use identifiers instead of ever\nchanging names. For example, it turns out the half life of an HGNC gene symbol\nis very short -\n<a href=\"https://github.com/bio2bel/bio2bel-notebooks/blob/master/gene_symbol_half_life.ipynb\">thousands of them change every year</a>.\nHowever, if I want people to use identifiers instead of names in their\ndatabases, their papers, and other writing, there need to be really good tools\nfor looking up the names that go with each identifier and the cross-references\n(equivalences) to other databases that are talking about the same thing.</p>\n<p>So I built <a href=\"https://github.com/pyobo/pyobo\">PyOBO</a>. It includes tools for\nreading the OBO Foundry and getting all of the OBO resources that are available\n(as well as <em>many</em> manual fixes for incorrect metadata), it uses Daniel\nHimmelstein's <a href=\"https://github.com/dhimmel/obonet/\">Obonet</a> for parsing and\nstoring pre-parsed files for fast loading, and it applies a swath of rule-based\nnormalization that I've\n<a href=\"https://github.com/pyobo/pyobo/blob/master/src/pyobo/registries/metaregistry.json\">manually curated</a>\nby personally reading all of the OBO files, their identifiers, their\ncross-references, relationships, properties, and everything else. When it comes\nto data, there really is no way around getting your hands dirty.</p>\n<p>I also went ahead and\n<a href=\"https://github.com/pyobo/pyobo/tree/master/src/pyobo/sources\">wrote parsers and converters</a>\nfor lots of other databases like Entrez, ComplexPortal, InterPro, and others so\nthey could play nice with the rest of the ecosystem. Of course, this is an\nongoing process. There are always more databases to include, and when it comes\nto super-sized ones like PubChem, the paradigms I used might not hold up anymore\n(though I did write parser/converter for it and you're welcome to use it).</p>\n<p>After this long journey of a blog post, I think we're ready to address Rihanna's\nperrenial question: what's my name? Until now, there really didn't exist a\nservice that let you look up the name for an entity by its CURIE. The link I\ngave for the OLS is the closest I have found, and that just doesn't cut it.</p>\n<p>After all of this coding, I wrote a script (just run <code class=\"language-plaintext highlighter-rouge\">obo ooh-na-na</code>) that takes\nall of the available sources, normalizes their namespaces, normalizes their\nidentifiers, and dumps them as a big 'ol TSV file. 3 columns - namespace,\nidentifier, and name. No nonsense. Probably legal! Get it at\n<a href=\"https://doi.org/10.5281/zenodo.3756206\"><img alt=\"DOI\" src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.3756206.svg\"/></a>.\nI'll make updates periodically as I add more sources, such as if/when I feel\ncomfortable with including the PubChem dump - the\n<a href=\"https://cthoyt.com/2020/04/18/ooh-na-na.html/ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Monthly/2020-04-01/Extras/CID-Title.gz\">CID-Title.gz</a>\nfile is about 1.3 gigabytes, which means this will significantly increase the\nsize, but not so much that it's unreasonable.</p>\n<p>I can imagine that most people probably won't want to download this file, or\nload it in memory (un-gzipped) every time they want to use it. I wrote a simple\nweb service that wraps this dataset\n<a href=\"https://github.com/pyobo/pyobo/blob/master/src/pyobo/apps/resolver.py\">included in PyOBO</a>.\nIt should be as easy as running with the shell with\n<code class=\"language-plaintext highlighter-rouge\">python -m pyobo.apps.resolver</code> then running the following python code:</p>\n<div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"kn\">import</span> <span class=\"nn\">requests</span>\n\n<span class=\"c1\"># This is an exact match\n</span><span class=\"n\">successful_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/DOID:14330'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"identifier\": \"14330\", \"name\": \"Parkinson's disease\", \"prefix\": \"doid\", \"query\": \"DOID:14330\", \"success\": True}\n</span>\n<span class=\"c1\"># This one remaps the prefix if you get it slightly wrong\n</span><span class=\"n\">successful_remapped_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/DO:14330'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"identifier\": \"14330\", \"name\": \"Parkinson's disease\", \"prefix\": \"doid\", \"query\": \"DO:14330\", \"success\": True}\n</span>\n<span class=\"c1\"># This one can't find the identifier.\n</span><span class=\"n\">unsuccessful_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/DO:00000'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"identifier\": \"00000\", \"message\": \"Could not look up identifier\", \"prefix\": \"doid\", \"query\": \"DO:00000\", \"success\": False}\n</span>\n<span class=\"c1\"># Keep in mind, the point of this service isn't to validate identifiers.\n</span><span class=\"n\">unsuccessful_crazy_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/DO:thisIsNotRightAtAll'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"identifier\": \"thisIsNotRightAtAll\", \"message\": \"Could not look up identifier\", \"prefix\": \"doid\", \"query\": \"DO:thisIsNotRightAtAll\", \"success\": False}\n</span>\n<span class=\"c1\"># No mercy for bad prefixes\n</span><span class=\"n\">unsuccessful_prefix_lookup</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/resolve/notanamespace:0000'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"message\": \"Could not identify prefix\", \"query\": \"notanamespace:0000\", \"success\": False}\n</span></code></pre></div></div>\n<p>It's especially important that the service normalizes curies first, so both\n<code class=\"language-plaintext highlighter-rouge\">DOID:14330</code>, <code class=\"language-plaintext highlighter-rouge\">doid:14330</code>, and <code class=\"language-plaintext highlighter-rouge\">DO:14330</code> can all be resolved to their name,\n<em>Parkinson's disease</em>. Because I did extensive manual curation of namespaces and\ntheir synonyms, <code class=\"language-plaintext highlighter-rouge\">NCBITaxon</code> and <code class=\"language-plaintext highlighter-rouge\">taxonomy</code> are both acceptable as well. However,\nthis service doesn't load from the aforementioned TSV, but rather takes\nadvantage of PyOBO's internal code for looking up mappings. I can imagine lots\nof ways I might re-write this service to directly take advantage of this dump (I\nalso invite you to do the same, however best suits you) such as loading it into\nEdgeDB and auto-generating a GraphQL endpoint.</p>\n<p>The last thing that I'm looking into getting this service hosted so everyone can\nbenefit from it without doing dev-ops in their own organizations. Then I will\ncontinue to obfuscate all usage and documentation with references to pop\nculture. Enjoy!</p>","doi":"https://doi.org/10.59350/wmj3y-04914","guid":"https://cthoyt.com/2020/04/18/ooh-na-na","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1587168000,"rid":"81da2-mtb29","summary":"We have a big problem in the bioinformatics community with namespaces, identifiers, and names. And nobody's posed the question better than Rihanna herself.","tags":["OBO","Lexica"],"title":"Ooh Na Na, What's My Name?","updated_at":1781539907,"url":"https://cthoyt.com/2020/04/18/ooh-na-na.html","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>On top the issue of <a href=\"https://cthoyt.com/2020/04/18/ooh-na-na.html\">resolving identifiers to their\nnames</a>, the bioinformatics community has a\nhard time figuring out when two identifiers from different databases are\nequivalent. You know who else has the same problem? Inspector Javert. Get ready\nfor a <em>Les Miserables</em>-themed post on how to address this long-standing problem.</p>\n<p>I have to start my tale of woes by disclosing my source material. I loved both\nthe 1985 and 1987 recordings from the respective original London and Broadway\ncasts. But, for the purposes of this post, I will assume that you've seen the\nexcellent 2012 film adaptation of Alain Boublil, Jean-Marc Natel, and Herbert\nKretzmer's musical adaptation of Victor Hugo's novel <em>Les Miserables</em> and tell\nthe story through that perspective. I also want to you to know that I enjoyed\nRussell Crowe's Inspector Javert very much.</p>\n<p><em>Les Miserables</em> begins with the Work Song, in which the protagonist,\n<code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code> is confronted by Inspector Javert while doing some\u2026 work. He\ninsists he has a name, Jean Valjean and his identifier in his\n<a href=\"https://en.wikipedia.org/wiki/Faverolles,_Aisne\">home village</a>'s fictional\ndatabase (that I just retconned) was <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code>. Javert isn't interested\nin his name. It's enough that he has a cross-reference between <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code>\nis equivalent to <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. He was only there to inform Jean Valjean that\nhis parole has begun and issues him a <em>passeport jaune</em> (yellow ticket) for the\ncommune of <a href=\"https://en.wikipedia.org/wiki/Pontarlier\">Pontarlier</a>.</p>\n<p>I'm sure this passport also had an identifier on it. I'm going to take a bit of\ncreative freedom and say it was <code class=\"language-plaintext highlighter-rouge\">pontarlier:25791</code>. It probably also had Jean\nValjean's prisoner number on it so everybody knew he was in the 1800's fictional\nFrench convict database. The fictional 1800's French took maintaining\ncross-references very seriously.</p>\n<p>Jean Valjean never made it to Pontarlier. Instead, he broke his parole, forged\nsome new documents, and went to Montreuil-sur-Mer under the new name of Monsieur\nMadeleine. It's probably the case that his identifier for the Montreuil-sur-Mer\ncity database was <code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code>, or something like this (more\nretcons!). It must have been a good fake, because even the king of France\nrecognized him (note: this plot point did not appear in the film).</p>\n<p>Javert figured out Jean Valjean broke his parole basically immediately and set\nout on his quest to find and capture <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code> once again. Until this\npoint, Javert has access to the prisoner registry and yellow tickets. He knows\n<code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code> is the same as both <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code> and <code class=\"language-plaintext highlighter-rouge\">pontarlier:25791</code>.</p>\n<p>The part that will hit close to home for many bioinformaticians is that when\nJavert goes to Montreuil-sur-Mer, he meets Monsieur Madeleine. He is unaware\nthat it is Jean Valjean. There is no cross-reference between <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code>\nand <code class=\"language-plaintext highlighter-rouge\">pontarlier:25791</code> or <code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code>. If there were a\ncross-reference in the fictional French 1800's inspector database, Javert could\nhave arrested Jean Valjean on sight. Instead, Javert had to the hard work of\ncurating cross-references himself and finding out who was the same in the\n<code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer</code> database as <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. Admittedly, he probably would\nhave called this <em>inspecting</em>.</p>\n<p>The next part that will hit even closer to home for many bioinformaticians is\nthat after his inspecting, Javert actually identified the wrong guy! This lead\nto one of the my favorite songs in musical theater ever\n(<a href=\"https://www.youtube.com/watch?v=izuD30Cp5Ao\">Who Am I?</a>), where Monsieur\nMadeleine (<code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code>, also actually Jean Valjean, but Javert\ndidn't yet realize this) admits that he is actually <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. In this\nextended metaphor of a blog post, Jean Valjean's confession in \"Who Am I\" is\neffectively the same as a database providing its own cross-references to other\ndatabase. Would be nice if everyone did this, and did it well, huh?</p>\n<p>You should know that Javert is a powerful cross-reference reasoning machine. He\nalready knew <code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code> was the same as <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. Now he knew\nthat <code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code> was the same as <code class=\"language-plaintext highlighter-rouge\">prisoner:24601</code>. This way, he\ncould infer that <code class=\"language-plaintext highlighter-rouge\">montreuil-sur-mer:1357</code> (Monsieur Madeleine) is actually\n<code class=\"language-plaintext highlighter-rouge\">faverolles:2468</code> (Jean Valjean). One of the nice properties of cross-references\nis that they're transitive through any number of connections. We'll take\nadvantage of this fact later. You'll also have to excuse the fact that\nthroughout this post, I'm operating under the assumption that \"cross-references\"\nand \"equivalences\" are the same thing. That's not always true, and sometimes it\ncan even get you in trouble. For example, provenance can be a cross-reference,\ndisease-gene associations are considererd as cross-references in MONDO (I\nthink), and OBO even gives specific semantics for when you should consider this\nassumption valid. We'll just have to live with it for now.</p>\n<p>Javert might have got lucky that Jean Valjean revealed himself once, but the\nshow must go on! Jean Valjean had many more songs to sing and thus had to escape\nfrom Montreuil-sur-Mer to Paris. This meant that Javert has to find <em>another</em>\nmapping to Jean Valjean's new <code class=\"language-plaintext highlighter-rouge\">paris</code> identifier. And we already know that the\nFrench 1800's inspector database of cross-references was not being maintained.\nExhausting!</p>\n<hr/>\n<p>In the bioinformatics community, we have a very similar problem to Inspector\nJavert. There are lots of databases that are talking about the same things, but\nonly a few of them provide mappings between each other. This means that we\neither have to curate our own cross-references, do our best to infer new\ncross-references based on ones we already have, or throw our hands in the air.</p>\n<p>Luckily, we have a few standardized resources to fall back on. In addition to\nstandardizing the storage of identifier/name pairs, the OBO format standardizes\nthe way cross-references are stored and the OBO Foundry already contains quite a\nfew cross-references imported from the ontologies that it covers.</p>\n<p>One of the most difficult entity types to map from database to database are\nphenotypes because of the variety of language used to describe each, the\ndifferences in semantics of how each is defined, and the sheer number of\ndatabases. Unfortunately, some of the most popular like MeSH and to an extent,\nUMLS, NCIT, SNOMED-CT, and ICD (seemingly the culprits are mostly American!?)\nprovide very little accessible information. Some are even paid, so the ony\ncross-references that exist are externally curated ones from other laudable\nsources like HP, DOID, and EFO. In fact, dealing with phenotypes is such a pain,\nthat there is a project called the\n<a href=\"https://monarchinitiative.org/\">Monarch Initiative</a> that has a huge staff\ntrying to solve exactly this problem and publish the results through the\n<a href=\"https://github.com/monarch-initiative/mondo\">Monarch Disease Ontology (MONDO)</a>.\nNormally, I would reference\n<a href=\"https://xkcd.com/927/\">this XKCD comic about making new standards</a> when hearing\nabout something like this. But these are dire times, and one of my opinions is\nthat you should always trust curators who love what they do.</p>\n<p>There are also lots of cross-references available from databases that don't\nmaintain their nomenclature as an ontology. One example is\n<a href=\"https://downloads.thebiogrid.org/File/BioGRID/Latest-Release/BIOGRID-IDENTIFIERS-LATEST.tab.zip\">BioGRID</a>,\nwhich assigns proteins internal accession numbers, but almost all of them\ncross-reference out to Entrez Gene (I counted less than 15 that didn't, and 3 of\nthem are COVID-related, so cut them some slack). As an aside, I don't really\nunderstand why BioGRID would go through the effort of maintaining their own\naccession numbers. In the literal handful of cases where they can't reference\nEntrez Gene, I think it would be better to email the maintainers and work with\nthem to make improvements.</p>\n<p>It's also worth noting that excellent resources like HGNC, MGI, RGD, SGD,\nEnsembl, UniProt, and others in the genome (and gene product) nomenclature do a\nstellar job at maintaining cross-references. So to all of the curators and\nmaintainers who work there, I would like to sincerely thank you.</p>\n<p>There are also community-curated cross-references sources. One of the notable\nones is from Harvard Medical School, that's mapping MeSH identifiers to gene\nidentifiers in the\n<a href=\"https://raw.githubusercontent.com/indralab/gilda/master/gilda/resources/mesh_mappings.tsv\">Gilda GitHub repository</a>.\nI think this is really a good time to point out that MeSH contains a bit of\neverything, is ubiquitous throughout the bioinformatics community, and in my\nopnion is is doing a huge disservice by not providing these kinds of mappings\nitself. Or, alternatively, it is, and both the Harvard guys and I have never\nfound it. It's not impossible, but we're all very motivated, so I think we would\nhave found if it did. If any MeSH maintainers are reading this and want help\nmaking this happen, I would be elated to donate my time to you to help solve\nthis problem.</p>\n<p>With all these data source in mind, I built an extensible pipeline in\n<a href=\"https://github.com/pyobo/pyobo/blob/master/src/pyobo/xrefdb/xrefs_pipeline.py\">PyOBO</a>\nfor extracting cross-references from entries in OBO Foundry and other\ncross-reference sources. Throughout the process, I realized that these sources\nhave an incredible variety in how they name prefixes and how the OBO format\nitself has been (ab)used. I made lots of improvements, wrote extensible code\nthat allowed the specification of new rules through external files (and thus\nless code writing in the future), and did lots more curation. I won't get into\nthe technical part of that here, since you can read the code (if you dare).</p>\n<p>After all that this coding, I wrote a script (just run <code class=\"language-plaintext highlighter-rouge\">obo javerts-xrefs</code>) that\ntakes all available cross-references, normalizes their namespaces, normalizes\ntheir identifiers, and dumps them in a big 'ol TSV file. 5 columns - source\nnamespace, source identifier, target namespace, target identifier, and\nprovenance (ontology name or URL). No nonsense. Get it at\n<a href=\"https://doi.org/10.5281/zenodo.3757266\"><img alt=\"DOI\" src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.3757266.svg\"/></a>.\nI'll make updates periodically as I add new sources.</p>\n<hr/>\n<p>Once you have a database of cross-references, you have actually built an\nundirected graph. Equivalences go both ways, and they are transitive. This means\nthat every connected component in an equivalence graph represents a set of\nentities that are mutually equivalent. In other words, if a path exists between\ntwo nodes in an equivalent graph, then they are equivalent.</p>\n<p>Even better, you don't have to materialize all of the possible inferred\nequivalences when you have an equivalence graph because identifying all of the\nnodes in a connected component can be done in linear time with respect to the\nsize of the connected component, which is usually pretty small, by using a\nbreadth-first or depth-first search.</p>\n<p>Based off of that, one application of an equivalence graph is to identify all of\nthe nodes that are equivalent to a given node. You can also get a little tricker\nand identify the paths through which the traversal must go if you want to\nestablish an equivalency. You could even go further and weight edges based on\nhow much you trust the source from which they came to identify how much you\nshould believe in a mapping. For example, if you have a percent confidence in\neach mapping being right, then the confidence in the whole pathway would be the\nproduct of the confidences.</p>\n<p>The actual problem I set out to solve was given a set of entities, remap all of\nthem based on a prioritized list. For example, I might have a set of entities\nthat contains HGNC genes, Entrez Genes, and OMIM genes. If my favorite\nnomenclature consortium is Entrez, my second favorite is HGNC, and my third\nfavorite is OMIM and I have an equivalence database, I might want to remap all\nof my identifiers. This is very important during the curation of mechanistic\nbiology (such as with BEL), since curators will likely use all sorts of\nidentifiers with no clear guidelines or rules. This means that the same entity\nmight appear twice with different identifiers in the same curated data!</p>\n<p>Given a priority list, you can even transform an equivalence graph into a\ndirected graph where each identifier has a single out edge pointing towards the\nidentifier that is the best mapping. Then, each connected component would become\na star graph. There's actually a better data structure for this, since each\nentity points to exactly one thing - a mapping. This is a more efficient data\nstructure for storage, and if your graph is implemented as an adjacency\ndictionary (becuase you're using <code class=\"language-plaintext highlighter-rouge\">networkx</code>, right?), then you basically already\nhave this.</p>\n<p>I've provided an implementation for all of these in PyOBO. They can be run as a\nweb API with <code class=\"language-plaintext highlighter-rouge\">python -m pyobo.apps.mapper</code>. There's a keyword argument to allow\nyou to load the TSV from Inspector Javert's Xref Database directly, or if you're\nfeeling lucky, to regenerate it yourself. Below I will give a few examples of\nhow to use it. Later, I would also like to host this service for anyone to use.</p>\n<ol>\n<li>Install PyOBO with <code class=\"language-plaintext highlighter-rouge\">pip install git+https://github.com/pyobo/pyobo.git</code></li>\n<li>Download Inspector Javert's Xref Database from Zenodo, unpack it, and find\nthe xrefs file.</li>\n<li>Run the web service with\n<code class=\"language-plaintext highlighter-rouge\">python -m pyobo.apps.mapper -x inspector_javerts_xrefs.tsv.gz</code></li>\n<li>Use the following code to figure stuff out!</li>\n</ol>\n<div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"kn\">import</span> <span class=\"nn\">requests</span>\n\n<span class=\"c1\"># Get all entities mapped to MAPT, including through chains of xrefs\n</span><span class=\"n\">successful_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/mappings/hgnc:6893'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"s\">\"\"\"\n{\n    \"orphanet:123144\": [\n        {\n            \"provenance\": \"hgnc\",\n            \"source\": \"hgnc:6893\",\n            \"target\": \"orphanet:123144\"\n        }\n    ],\n    \"pr:P10636\": [\n        {\n            \"provenance\": \"hgnc\",\n            \"source\": \"hgnc:6893\",\n            \"target\": \"uniprot:P10636\"\n        },\n        {\n            \"provenance\": \"pr\",\n            \"source\": \"uniprot:P10636\",\n            \"target\": \"pr:P10636\"\n        }\n    ],\n    ...\n}\n\"\"\"</span>\n\n<span class=\"c1\"># Keep in mind this isn't a validation service\n</span><span class=\"n\">unsuccessful_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/mappings/hgnc:0000'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"message\": \"could not find curie\", \"query\": {\"curie\": \"hgnc:0000\"}, \"success\": False}\n</span>\n<span class=\"c1\"># Get all paths mapping MAPT in HGNC to Ensembl. Returns a list of paths (which are lists of xrefs)\n</span><span class=\"n\">path_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/mappings/hgnc:6893/ensembl:ENSG00000186868'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"s\">\"\"\"\n[\n    [\n        {\n            \"provenance\": \"hgnc\",\n            \"source\": \"hgnc:6893\",\n            \"target\": \"ensembl:ENSG00000186868\"\n        }\n    ]\n]\n\"\"\"</span>\n\n<span class=\"c1\"># Get the priority identifier for MAPT identified by Ensembl\n</span><span class=\"n\">prioritize_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/prioritize/cosmic:MAPT'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"found\": True, \"query\": \"cosmic:MAPT\", \"result\": \"hgnc:6893\"}\n</span>\n<span class=\"c1\"># What happens when a CURIE can't be found for prioritization\n</span><span class=\"n\">unsuccessful_prioritize_request</span> <span class=\"o\">=</span> <span class=\"n\">requests</span><span class=\"p\">.</span><span class=\"n\">get</span><span class=\"p\">(</span><span class=\"s\">'http://localhost:5000/prioritize/cosmic:NOPE'</span><span class=\"p\">).</span><span class=\"n\">json</span><span class=\"p\">()</span>\n<span class=\"c1\"># {\"found\": False, \"query\": \"cosmic:NOPE\"}\n</span></code></pre></div></div>\n<p>I'd like to give a big thanks to my high school music teacher, Ken Tedeschi, for\nhelping me (and basically everyone else) fall in love with Les Mis in high\nschool. Writing about my work was so much more fun in extended metaphor. I would\nalso like to thank Hugh Jackman. You know, for being Hugh Jackman.</p>\n<hr/>\n<p>I have some random afterthoughts that I think might be worth including, that I'm\nadding after originally posting this.</p>\n<p>You might be wondering why I didn't get into a discussion about the\n<a href=\"https://www.ebi.ac.uk/about/news/announcement/industry-collaboration-ontology-mapping-service\">Ontology Mapping Service (OXO)</a>\nfrom the EBI. It looks to me like this project has been abandoned. Even if not,\nit's API has most of the same issues that I described in a <a href=\"https://cthoyt.com/2020/04/18/ooh-na-na.html\">previous\npost</a>.</p>\n<p>I'm also aware of <a href=\"https://bridgedb.github.io\">BridgeDB</a>, from which I think I\nwill be able to take inspiration to include more xrefs later. However, I think\nthey're limited in scope, and PyOBO is more about standardizing data so nobody\nhas to figure out databases\u2026 again and again and again.</p>\n<p>One glaring omission from this work is WikiData mappings. I have a plan to\ninclude curated information in the PyOBO metaregistry that links databases to\ntheir WikiData properties. That will allow me to build an automated framework\nfor downloading these mappings, given the curation of the properties.</p>","doi":"https://doi.org/10.59350/r3qzt-z0d08","guid":"https://cthoyt.com/2020/04/19/inspector-javerts-xref-database","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1587254400,"rid":"vqbr3-bcb03","summary":"On top the issue of resolving identifiers to their names, the bioinformatics community has a hard time figuring out when two identifiers from different databases are equivalent. You know who else has the same problem? Inspector Javert. Get ready for a Les Miserables-themed post on how to address this long-standing problem.","tags":["Mappings"],"title":"Inspector Javert's Xref Database","updated_at":1781539905,"url":"https://cthoyt.com/2020/04/19/inspector-javerts-xref-database.html","version":"v1"},{"authors":[{"contributor_roles":[],"family":"Tapley Hoyt","given":"Charles","url":"https://orcid.org/0000-0003-4423-4370"}],"blog":{"authors":null,"community_id":"da4ef2af-5fad-46b2-8195-d77db0141ad6","created":1716422400,"current_feed_url":null,"description":"Unraveling complex biology with biological knowledge graphs. Content licensed under CC BY 4.0.","favicon":"https://rogue-scholar.org/api/communities/da4ef2af-5fad-46b2-8195-d77db0141ad6/logo","feed_format":"application/atom+xml","feed_url":"https://cthoyt.com/feed.xml","filter":null,"generator":"Jekyll","home_page_url":"https://cthoyt.com/","issn":null,"language":"eng","license":"https://creativecommons.org/licenses/by/4.0/legalcode","prefix":"10.59350","relative_url":null,"secure":true,"slug":"cthoyt","status":"active","subfield":"1312","title":"Biopragmatics","updated":1781539024,"use_api":null},"blog_name":"Biopragmatics","blog_slug":"cthoyt","content_html":"<p>As scientists, we place huge importance on the communication of our results. We\nspend lots of time on editing, revising, and formatting so people can understand\nwhat we did. We also write a lot of code, so why aren't we investing the same\namount of love? Enter, <a href=\"https://flake8.pycqa.org/en/latest/\">flake8</a>.</p>\n<p>It's incredibly important that we write following community standards so when\nother people read our work, they don't have to think about how it's organized.\nFor scientific prose, this usually means the IMRD\n(introduction-methods-results-discussion) format. In Python, my current favorite\nprogramming language for science, this means using a standardized number of\nspaces for indents (4), using triple-double quotes for docstrings in the\nbeginning of each module, class, and function, and lots more.</p>\n<p>It's pretty intimidating to figure out style. For english prose, Strunk and\nWhite wrote\n<a href=\"http://www.jlakes.org/ch/web/The-elements-of-style.pdf\"><em>The Elements of Style</em></a>.\nFor Python, Guido van Rossum wrote\n<a href=\"https://www.python.org/dev/peps/pep-0008/\">PEP-8</a> and Raymond Hettinger\npresented <a href=\"https://www.youtube.com/watch?v=wf-BqAjZb8M\">Beyond PEP-8</a>. Even with\nthese resources, it's still hard to learn which are rules and which\n<a href=\"https://www.youtube.com/watch?v=k9ojK9Q_ARE\">are more like guidelines</a>.</p>\n<p>This post is a short explanation of how I use <code class=\"language-plaintext highlighter-rouge\">flake8</code> to keep a consistent\nstyle in the code in my Python projects. There's a similar command line tool for\nfixing the style in R projects that's already built into most operating\nsystems - <code class=\"language-plaintext highlighter-rouge\">rm -rf *</code>, but I won't get more into that here.</p>\n<p>It's pretty easy to get up and running with <code class=\"language-plaintext highlighter-rouge\">flake8</code> - just run\n<code class=\"language-plaintext highlighter-rouge\">pip install flake8</code> then use it from the shell on a python file like\n<code class=\"language-plaintext highlighter-rouge\">flake8 my_file.py</code> or <code class=\"language-plaintext highlighter-rouge\">flake8 my_directory/</code>. Then, it outputs a list of\nproblems that need to be fixed on a line-by-line basis in your code.</p>\n<p><img alt=\"Flake8 Feedback\" src=\"https://cthoyt.com/img/flake8_output.png\"/></p>\n<p>You can also install plugins with <code class=\"language-plaintext highlighter-rouge\">pip</code> like that extend the kinds of things it\nchecks. A few that I install are:</p>\n<ul>\n<li><a href=\"https://github.com/gforcada/flake8-builtins\">flake8-builtins</a> - make sure you\ndon't accidentally name a variable the same thing as a builtin. This happens a\nlot with <code class=\"language-plaintext highlighter-rouge\">id</code>.</li>\n<li><a href=\"https://github.com/PyCQA/flake8-bugbear\">flake8-bugbear</a> - \"find likely bugs\nand design problems in your program\", like when you have an unused variable in\na loop</li>\n<li><a href=\"https://github.com/and3rson/flake8-colors\">flake8-colors</a> - add color to the\n<code class=\"language-plaintext highlighter-rouge\">flake8</code> output (explanation how to set up is below)</li>\n<li><a href=\"https://github.com/PyCQA/flake8-commas\">flake8-commas</a> - add trailing commas\nwhere appropriate</li>\n<li><a href=\"https://github.com/adamchainz/flake8-comprehensions\">flake8-comprehensions</a>\nreminders to use list comprehensions where appropriate</li>\n<li><a href=\"https://github.com/PyCQA/flake8-docstrings\">flake8-docstrings</a> - make sure\nyour docstrings are present and written in the right format</li>\n<li><a href=\"https://github.com/PyCQA/flake8-import-order\">flake8-import-order</a> - make\nsure your imports are organized properly</li>\n<li><a href=\"https://github.com/JBKahn/flake8-print\">flake8-print</a> - make sure you never\never ever use <code class=\"language-plaintext highlighter-rouge\">print()</code>. The literal only exception is when using print to get\ntext into a file with <code class=\"language-plaintext highlighter-rouge\">print(..., file=...)</code></li>\n<li><a href=\"https://github.com/MichaelKim0407/flake8-use-fstring\">flake8-use-fstring</a> -\nmake sure you're using f-strings instead of <code class=\"language-plaintext highlighter-rouge\">%</code> or <code class=\"language-plaintext highlighter-rouge\">.format()</code> formatting.\nException being for logging.</li>\n<li><a href=\"https://github.com/PyCQA/pep8-naming\">pep8-naming</a> - make sure names of\nvariables, classes, and modules look right.</li>\n<li><a href=\"https://github.com/PyCQA/pydocstyle/\">pydocstyle</a> - docstring style checker</li>\n</ul>\n<p>In each of my repositories, I put all of the information on how to install\n<code class=\"language-plaintext highlighter-rouge\">flake8</code> and its plugins then run them in a <code class=\"language-plaintext highlighter-rouge\">tox</code> configuration under the\n<code class=\"language-plaintext highlighter-rouge\">[testenv:flake8]</code> header so they can easily reproducibly run with\n<code class=\"language-plaintext highlighter-rouge\">tox -e flake8</code>. An example of part of one of my <code class=\"language-plaintext highlighter-rouge\">tox.ini</code> files (which always\nlives in the root of the repository) is below:</p>\n<div class=\"language-ini highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nn\">[testenv:flake8]</span>\n<span class=\"py\">skip_install</span> <span class=\"p\">=</span> <span class=\"s\">true</span>\n<span class=\"py\">deps</span> <span class=\"p\">=</span>\n    <span class=\"err\">flake8</span>\n    <span class=\"err\">flake8-bandit</span>\n    <span class=\"err\">flake8-builtins</span>\n    <span class=\"err\">flake8-bugbear</span>\n    <span class=\"err\">flake8-colors</span>\n    <span class=\"err\">flake8-commas</span>\n    <span class=\"err\">flake8-comprehensions</span>\n    <span class=\"err\">flake8-docstrings</span>\n    <span class=\"err\">flake8-import-order</span>\n    <span class=\"err\">flake8-print</span>\n    <span class=\"err\">flake8-use-fstring</span>\n    <span class=\"err\">pep8-naming</span>\n    <span class=\"err\">pydocstyle</span>\n<span class=\"py\">commands</span> <span class=\"p\">=</span>\n    <span class=\"err\">flake8</span> <span class=\"err\">src/pybel/</span> <span class=\"err\">tests/</span> <span class=\"err\">setup.py</span>\n<span class=\"py\">description</span> <span class=\"p\">=</span> <span class=\"s\">Run the flake8 tool with several plugins (bandit, docstrings, import order, pep8 naming).</span>\n</code></pre></div></div>\n<p>Another configuration file you can set up in the root of the repository is\n<code class=\"language-plaintext highlighter-rouge\">.flake8</code>. Unfortunately, the Python configuration file reader doesn't allow\nsome of the crazy characters that I want to use for the colors so this can't be\nincluded in <code class=\"language-plaintext highlighter-rouge\">setup.cfg</code> or <code class=\"language-plaintext highlighter-rouge\">tox.ini</code> like most of your other configuration.</p>\n<div class=\"language-ini highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nn\">[flake8]</span>\n<span class=\"py\">ignore</span> <span class=\"p\">=</span>\n    <span class=\"c\"># line break before binary operator\n</span>    <span class=\"err\">W503</span>\n<span class=\"py\">exclude</span> <span class=\"p\">=</span>\n    <span class=\"err\">.tox,</span>\n    <span class=\"err\">.git,</span>\n    <span class=\"err\">__pycache__,</span>\n    <span class=\"err\">docs/source/conf.py,</span>\n    <span class=\"err\">build,</span>\n    <span class=\"err\">dist,</span>\n    <span class=\"err\">tests/fixtures/*,</span>\n    <span class=\"err\">*.pyc,</span>\n    <span class=\"err\">*.egg-info,</span>\n    <span class=\"err\">.cache,</span>\n    <span class=\"err\">.eggs</span>\n<span class=\"py\">max-line-length</span> <span class=\"p\">=</span> <span class=\"s\">120</span>\n<span class=\"py\">import-order-style</span> <span class=\"p\">=</span> <span class=\"s\">pycharm</span>\n<span class=\"py\">application-import-names</span> <span class=\"p\">=</span>\n    <span class=\"err\">pybel</span>\n    <span class=\"err\">bel_resources</span>\n    <span class=\"err\">tests</span>\n<span class=\"py\">format</span> <span class=\"p\">=</span> <span class=\"s\">${cyan}%(path)s${reset}:${yellow_bold}%(row)d${reset}:${green_bold}%(col)d${reset}: ${red_bold}%(code)s${reset} %(text)s</span>\n</code></pre></div></div>\n<p>First thing you'll notice is the <code class=\"language-plaintext highlighter-rouge\">ignore</code> list. This isn't here to turn <code class=\"language-plaintext highlighter-rouge\">flake8</code>\noff because you're feeling lazy. If somebody includes a change in this list in\ntheir PR, you have to explain to them that compliance is not optional, then help\nthem work through the problem that they obviously gave up on solving. It's\nactually there for you, as the project maintainer, to enumerate the <code class=\"language-plaintext highlighter-rouge\">flake8</code>\nrules that you don't agree with. For example, I totally disagree with the <code class=\"language-plaintext highlighter-rouge\">W503</code>\nline break before operator rule. I want to write long conditionals with and\nstatements on the first line, like this:</p>\n<div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"k\">if</span> <span class=\"p\">(</span>\n   <span class=\"n\">condition_1</span>\n   <span class=\"ow\">and</span> <span class=\"n\">condition_2</span>\n   <span class=\"ow\">and</span> <span class=\"n\">condition_3</span>\n<span class=\"p\">):</span>\n    <span class=\"k\">print</span><span class=\"p\">(</span><span class=\"s\">'all true'</span><span class=\"p\">)</span>\n</code></pre></div></div>\n<p>One of the benefits of this style is you can add more lines with only single\nline diffs. The other is that the reader always sees the operation that goes\nwith each line. Same could be done with arithmatic that could incorporate not\nonly <code class=\"language-plaintext highlighter-rouge\">+</code> but also <code class=\"language-plaintext highlighter-rouge\">-</code>.</p>\n<p>Next is the <code class=\"language-plaintext highlighter-rouge\">exclude</code> block. Just copy/paste this each time, since it has lots\nof garbage you don't want <code class=\"language-plaintext highlighter-rouge\">flake8</code> to bother with. One of the checkers in\n<code class=\"language-plaintext highlighter-rouge\">flake8</code> is for function \"cyclomatic\" complexity. You can make the maximum\nnumber higher with <code class=\"language-plaintext highlighter-rouge\">max-complexity</code>. Normally, you want this to be enforced, but\nsometimes there's no way around a complex function. For this, you can add a code\ncomment <code class=\"language-plaintext highlighter-rouge\">noqa</code> followed by the error code like <code class=\"language-plaintext highlighter-rouge\"># noqa:W123</code>. Again, adding tags\nto ignore bad style just to pass <code class=\"language-plaintext highlighter-rouge\">flake8</code> is against the point.</p>\n<p>The <code class=\"language-plaintext highlighter-rouge\">max-line-length</code> is a very contentious setting. I think 120 is fine. Some\npeople think 78, 79, or 80 is best because of the standard sizes of old computer\nscreens or punch cards\u2026 When I get older and I can't read my computer screen,\nI'll probably make the text bigger and change my mind about this. If you find\nyourself breaking up lines in a totally non-sensical, unstyled way, then you're\nconforming too tightly to the rules. Sorry about the mixed messages!</p>\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>import-order-style = pycharm\napplication-import-names =\n    pybel\n    bel_resources\n    tests\n</code></pre></div></div>\n<p>I copied this again because this part is really important. You have to tell\n<code class=\"language-plaintext highlighter-rouge\">flake8</code> what rules you use for import order. I use the pycharm rules, which\ngroup python builtin packages, then 3rd party packages, then my packages. The\n<code class=\"language-plaintext highlighter-rouge\">application-import-names</code> is a place to list what are your packages.</p>\n<p>Last is the <code class=\"language-plaintext highlighter-rouge\">format</code> entry, which gives the nice colorful output. Copy paste\nthis! I borrowed mine from <a href=\"https://github.com/scolby33\">Scott Colby</a>.</p>\n<hr/>\n<p>After all of that, I set up Travis CI to run <code class=\"language-plaintext highlighter-rouge\">tox</code> every time code is pushed to\nthe repository. If you're working in a team, you probably do something like the\nfork/pull request or branch/pull request workflow on GitHub to support doing\ncode review before merging new code. The best part is that there's a big box on\neach pull request that checks if <code class=\"language-plaintext highlighter-rouge\">flake8</code> passed (among other tests), which\nmeans that there were no errors detected.</p>\n<p>I encourage my teammates to make pull requests as soon as they start working on\ncode. GitHub even has a \"draft pull request\" mode now. However, before asking\nanyone to review your code, it has to pass <code class=\"language-plaintext highlighter-rouge\">flake8</code>. And obviously, no code that\nisn't passing flake8 can be merged.</p>\n<p>This is a <em>very</em> painful process to get people used to. I've done it with many\ngroups of people and always got pushback. However, everyone who has gone through\nthe process with me has come out the other side happy that they did it. It's\nimportant that when you start enforcing coding rules on other people that you\nare a resource for them - when somebody is frustrated by a flake8 error code\nthey have never seen, they will likely forget how to use Google. They will\nprobably ask you for help. You have to resist the urge to send\n<a href=\"https://lmgtfy.com\">lmgtfy</a> links to them and be patient. Because eventually,\nthey will do it on their own, and spread the gospel of <code class=\"language-plaintext highlighter-rouge\">flake8</code>.</p>\n<p>While a good arsenal of <code class=\"language-plaintext highlighter-rouge\">flake8</code> plugins provides a solid foundation, it's not\nall that needs to be done to make your code readable and look good. Just like\nwith reading and speaking, the best way to develop a sense of style is by\nreading <em>lots</em> of code (with the caveat that reading poorly written code\nprobably won't teach you much). Within the rules imposed by <code class=\"language-plaintext highlighter-rouge\">flake8</code>, there is\nlots of space for style. If you watch lectures from David Beazley, you'll notice\na very different style from Raymond Hettinger, and also from me.</p>\n<p>Now that you've made it to the end of this short guide, I wish you the best of\nluck on developing your own style!</p>\n<hr/>\n<p>Are you working with people who are particularly unsusceptible to Travis CI\nemails or checking the big red box on pull requests? You could try getting them\nset up with <a href=\"https://pre-commit.com/\">pre-commit hooks</a>, which run the style\nchecks locally any time they try and push (even if it's to a branch) and it will\ngive them the message in the console.</p>\n<p>Is style not your thing at all / you're not ready to let go of your identity as\na Java/Perl developer? Maybe consider <a href=\"https://github.com/psf/black\">Black</a>,\nwhich actually re-writes your code in a deterministic style. I don't live by it,\nbut it's a great tool to run on a code base that's never been loved before going\nback and stylizing it.</p>","doi":"https://doi.org/10.59350/cfxtw-0ma23","guid":"https://cthoyt.com/2020/04/25/how-to-code-with-me-flake8","language":"en","license":"https://creativecommons.org/licenses/by/4.0/legalcode","published_at":1587772800,"rid":"4y55t-c1k96","summary":"As scientists, we place huge importance on the communication of our results. We spend lots of time on editing, revising, and formatting so people can understand what we did. We also write a lot of code, so why aren't we investing the same amount of love? Enter, flake8.","tags":["Code With Me"],"title":"How to Code with Me - Flake8 Hell","updated_at":1781539903,"url":"https://cthoyt.com/2020/04/25/how-to-code-with-me-flake8.html","version":"v1"}],"out_of":50584,"page":1,"per_page":10,"total-results":50584}
