<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Data 'n' Coffee]]></title><description><![CDATA[Database meets coffee mug and starts a family.]]></description><link>https://www.datancoff.ee</link><image><url>https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png</url><title>Data &apos;n&apos; Coffee</title><link>https://www.datancoff.ee</link></image><generator>Substack</generator><lastBuildDate>Mon, 06 Apr 2026 19:36:51 GMT</lastBuildDate><atom:link href="https://www.datancoff.ee/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Serhii Sokolenko]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[datancoffee@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[datancoffee@substack.com]]></itunes:email><itunes:name><![CDATA[Serhii Sokolenko]]></itunes:name></itunes:owner><itunes:author><![CDATA[Serhii Sokolenko]]></itunes:author><googleplay:owner><![CDATA[datancoffee@substack.com]]></googleplay:owner><googleplay:email><![CDATA[datancoffee@substack.com]]></googleplay:email><googleplay:author><![CDATA[Serhii Sokolenko]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Scandal at Kaggle]]></title><description><![CDATA[As I write these words, I marvel at my silver medal from the 2024 Automated Essay Scoring competition on Kaggle.]]></description><link>https://www.datancoff.ee/p/the-scandal-at-kaggle</link><guid isPermaLink="false">https://www.datancoff.ee/p/the-scandal-at-kaggle</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Mon, 15 Jul 2024 15:25:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/45071c5d-74d8-40df-af3d-c5b11c51ff2f_357x337.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As I write these words, I marvel at <a href="https://www.kaggle.com/certification/competitions/datancoffee/learning-agency-lab-automated-essay-scoring-2">my silver medal</a> from the <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2">2024 Automated Essay Scoring</a> competition on Kaggle. This competition will go down in the annals of Data Science competitions as the equivalent of the French parliamentary election of 2024.</p><p>When the final rankings were published, the positions of the vast majority of participants were completely upended. Most top 20 participants dropped hundreds of positions down, and many mid-tier participants surged to the top of the rankings, myself included (I ended up placing 110th out of 2700 teams). My silver medal is mostly undeserved, but there are valuable lessons in what happened, applicable to real-life ML tasks, and that&#8217;s what this blog is about.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FwQe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FwQe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png 424w, https://substackcdn.com/image/fetch/$s_!FwQe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png 848w, https://substackcdn.com/image/fetch/$s_!FwQe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png 1272w, https://substackcdn.com/image/fetch/$s_!FwQe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FwQe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png" width="357" height="337" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:337,&quot;width&quot;:357,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FwQe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png 424w, https://substackcdn.com/image/fetch/$s_!FwQe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png 848w, https://substackcdn.com/image/fetch/$s_!FwQe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png 1272w, https://substackcdn.com/image/fetch/$s_!FwQe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3601f127-7ee2-44fb-9bcd-954f85d88fe5_357x337.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Silver Medal <a href="https://www.kaggle.com/certification/competitions/datancoffee/learning-agency-lab-automated-essay-scoring-2">placement</a> in the AES 2.0 Kaggle Competition</p><p>The competition&#8217;s goal was to develop a system to grade English language essays on a scale between 1 and 6. The competition&#8217;s sponsors (the Learning Agency Lab at Vanderbilt University, the Bill &amp; Melinda Gates Foundation, Schmidt Futures, and the Chan Zuckerberg Initiative) were looking for an automated grading system that would reduce the expense and time required to hand-grade these essays. It would also allow students to receive regular and timely feedback on their writing. Imagine a system that works like Grammarly and can improve your emails, slack messages, and, yes, this blog, available for free to all.&nbsp;</p><p>Sponsored to the tune of $50,000 in prizes, the competition attracted ~3300 participants organized into ~2700 teams. As one of these participants, I had the chance to observe how the community collaborated to exchange ideas and educate first-timers. As is customary in Kaggle competitions, after their final rankings were announced, the top teams published their analysis of what worked and what did not, and I used these writeups, together with my own observations to come up with 5 main lessons from this dramatic competition.&nbsp;</p><p>But before I share these 5 lessons, let me explain a few important things.&nbsp;</p><p><strong>Public Training data / Private Test Data:</strong> In this competition, the organizers publicly shared a training dataset of about 17,000 English-language essays, but kept their test dataset (rumored to be around 8000 essays) private. Furthermore, they divided this <em>private</em> test dataset into:</p><ul><li><p><strong>1/3rd</strong> used for continuously scoring submissions (every participant was allowed up to 5 submissions per day, and many submitted hundreds of submissions over the course of the competition. I submitted about 70). While the actual test essays were never published, the scores of submissions based on these test essays <em>were</em> published on something called the <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/leaderboard?tab=public">Public Leaderboard</a>. These scores were used by participants to evaluate whether they were on the right path and making improvements.</p></li></ul><ul><li><p><strong>2/3rds</strong> of the test data was held until the very end, to run the final scoring and ranking on the <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/leaderboard?">Private Leaderboard</a>. While the smaller test dataset (the first 1/3rd) was indirectly accessible to users via so-called &#8220;probing&#8221;, the larger, (truly-private) test dataset was a complete mystery until the very end. Btw, this larger test dataset was the one that completely upended the rankings, in a phenomenon known on Kaggle as a &#8220;shake-down&#8221;.</p></li></ul><p><strong>Submission Scores:</strong> Each submission was evaluated on a metric called &#8203;&#8203;Quadratic Weighted Kappa (QWK). This metric looks at predicted labels and the true labels, and calculates a level of agreement between the two (how well the predicted labels match the true labels). This metric is similar to a simple percent agreement calculation, but is more robust against the predictions and true values being the same by chance.&nbsp;</p><p>The important thing to remember here is that the maximum score a submission could have achieved would be 1.0. Of course, no submission was as good as a 1.0. The winning submission had a QWK score of 0.84128, the cut-off for Gold Medals was 0.83583 and I had a score of 0.83523 (yes, only 0.0006 apart from Gold, not that I would have deserved it).</p><p>Now, let&#8217;s get back to our top 5 learnings.</p><p><strong>1. The Importance of Good Data and Data Provenance</strong></p><p>The best teams realized (through Exploratory Data Analysis) that training data was sourced from two places (dataset A &#8211; about 75% of test data, came from <a href="https://www.kaggle.com/datasets/nbroad/persaude-corpus-2">Persuade Corpus 2.0</a>, and dataset B &#8211; another 25% &#8211; appeared to be entirely new and never used before this competition). The #4 participant in this competition, Tascj, <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/516639">theorized</a> that &#8220;the scores from the two data sources are collected in different ways, thus not compatible&#8221;.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3HjT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3HjT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png 424w, https://substackcdn.com/image/fetch/$s_!3HjT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png 848w, https://substackcdn.com/image/fetch/$s_!3HjT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png 1272w, https://substackcdn.com/image/fetch/$s_!3HjT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3HjT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png" width="1024" height="572" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:572,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3HjT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png 424w, https://substackcdn.com/image/fetch/$s_!3HjT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png 848w, https://substackcdn.com/image/fetch/$s_!3HjT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png 1272w, https://substackcdn.com/image/fetch/$s_!3HjT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c882488-3cf0-4c8e-8d34-310281de525f_1024x572.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Another big finding was that the &#8220;truly-private&#8221; test dataset used for final rankings was more similar to the non-Persuade dataset B. Effectively, our training dataset has some data that looked like test data, and some data that was different from test data. The <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/516791">1st ranked</a> submission even &#8220;fixed&#8221; the scores of the Persuade-based dataset A in a technique called &#8220;pseudo labeling&#8221;. Many top submissions (<a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/516791">1st</a>, <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/516790">2nd</a>, <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/517014">3rd</a>, <a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/516922">5th</a>) ended up dividing their training into two steps, pre-training on the Persuade dataset A, and fine-tuning on the B dataset. Other submissions used the source of data as a predictive feature (<a href="https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/516639">4th</a>).</p><p><em>Insight: If you work in a corporate setting, you will most probably recognize the problem and possible solutions. Knowing where the data came from, having a catalog of data sources with metadata on creators, would save you several months of EDA and guessing. In this competition, this information made the difference between being in the top 5% and top 0.5%.</em></p><p><strong>2. Knowing when to stop prevented overfitting</strong></p><p>Data exploration led to the discovery of the two subsets of data. But this was not the only finding. It turned out that the Persuade subset of data consisted of essays on 7 topics, such as &#8220;Car-free cities&#8221; and &#8220;Does the electoral college work?&#8221;. Many top performing competition participants spent the final month of the competition trying to determine the topic of the essay based on its text, and use it as a predictive feature. They probed the private test dataset by dozens of submissions trying to determine if the test dataset contained any new topics.&nbsp;</p><p>As it turned out, this was not the case, and the private test data was not from the Persuade dataset, but the damage was done. A lot of participants over-optimized their models for the Persuade data by clustering the essays in groups that represented those 7 essay topics. When their models were applied to the &#8220;truly-private&#8221; test dataset, they underperformed and teams were &#8220;shaken-down&#8221; by hundreds of positions on the Private Leaderboard.</p><p><em><strong>Former mid-tier-ranked participants surging to the top&nbsp;</strong></em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hDxB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hDxB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!hDxB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!hDxB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!hDxB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hDxB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png" width="1024" height="461" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15906032-647f-489d-ae66-e50f164e81b1_1024x461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:461,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!hDxB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!hDxB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!hDxB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!hDxB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15906032-647f-489d-ae66-e50f164e81b1_1024x461.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Former top 50 dropping hundreds of ranks down</strong></em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X8-_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X8-_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png 424w, https://substackcdn.com/image/fetch/$s_!X8-_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png 848w, https://substackcdn.com/image/fetch/$s_!X8-_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png 1272w, https://substackcdn.com/image/fetch/$s_!X8-_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X8-_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png" width="1024" height="225" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:225,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!X8-_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png 424w, https://substackcdn.com/image/fetch/$s_!X8-_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png 848w, https://substackcdn.com/image/fetch/$s_!X8-_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png 1272w, https://substackcdn.com/image/fetch/$s_!X8-_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03e50730-3bd9-491f-8fa5-095b83cb706c_1024x225.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>3. No Single Model (Compound AI Systems)</strong></p><p>All well-performing submissions, with few exceptions, used the ensemble approach, where multiple models were combined to produce a composite prediction. No single LLM placed in the top 1% of this competition. But LLMs were definitely used as part of the model ensembles (the Deberta v3 LLM turned out to be a very popular choice).&nbsp;</p><p>How the outputs of individual models were combined ranged from simple weights to techniques such as stacking. Even the &#8220;good enough&#8221; model (see Learning #4) that guaranteed a top 5% ranking was an ensemble of 2 Gradient Boosted Trees models.&nbsp;</p><p><em>Insight: Considering all the data prep and transformations steps in-between, all top submissions were effectively <a href="https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/">Compound AI Systems</a>. This is a new idea that is enabled by unified data and AI platforms, where one can put together multiple data processing steps and AI training or inference tasks into a graph of interacting components.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vmCe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vmCe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png 424w, https://substackcdn.com/image/fetch/$s_!vmCe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png 848w, https://substackcdn.com/image/fetch/$s_!vmCe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png 1272w, https://substackcdn.com/image/fetch/$s_!vmCe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vmCe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png" width="1024" height="324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:324,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vmCe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png 424w, https://substackcdn.com/image/fetch/$s_!vmCe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png 848w, https://substackcdn.com/image/fetch/$s_!vmCe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png 1272w, https://substackcdn.com/image/fetch/$s_!vmCe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F402b00b1-a4bd-4f1f-8a3e-37b8fae78a46_1024x324.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Read up on an example of a Compound AI System in my previous <a href="https://datancoff.ee/2024/04/compound-ai-systems-building-a-github-bot-with-llama-3-and-dlthub/">blog</a>.</p><p><strong>4. Gradient Boosted Decision Trees were Good Enough</strong></p><p>The approach that got you into the top 5% of submissions (silver medal territory) was based on gradient boosted decision trees (GBDT), something the Data Science community was crazy about 5-10 years ago. About two weeks into the competition <a href="https://www.kaggle.com/hideyukizushi">yukiZ</a>, a Kaggle Expert from Tokyo, Japan, shared a <a href="https://www.kaggle.com/code/hideyukizushi/aes2-deberta-lgbm-countvectorizer-lb-819?scriptVersionId=174653777">notebook</a> implementing a simple ensemble of two GBDT models &#8211; a LightGBM and a XGBoost model. That notebook, and derivative notebooks (especially the one from <a href="https://www.kaggle.com/batprem">Prem Chotepanit</a>), ended up scoring 0.83518+ &#8211; enough to place in the top 5%.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!215B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!215B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png 424w, https://substackcdn.com/image/fetch/$s_!215B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png 848w, https://substackcdn.com/image/fetch/$s_!215B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png 1272w, https://substackcdn.com/image/fetch/$s_!215B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!215B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4459ada-4d89-44a1-834c-1443c40182c5_904x860.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!215B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png 424w, https://substackcdn.com/image/fetch/$s_!215B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png 848w, https://substackcdn.com/image/fetch/$s_!215B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png 1272w, https://substackcdn.com/image/fetch/$s_!215B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4459ada-4d89-44a1-834c-1443c40182c5_904x860.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The &#8220;Good Enough&#8221; approach calculated tens of thousands of features based on paragraphs, sentences, and words of the essays. It then selected 13,000 most relevant features and used them to make predictions. LLMs were used in this approach, but only to calculate 6 features representing the probability of being the right essay score (1 to 6). These features, btw, ended up being some of the most predictive ones. Other predictive features were sentence and paragraph-level stats like number of errors in words, number of short and long sentences and paragraphs, number of short and long words (representing syntax knowledge and writing conventions). The rest of the 13K features were occurrences of particular words (representing the breadth of the vocabulary of essay writers).</p><p>Since this approach was entirely independent of the Persuade/non-Persuade test dataset fiasco, it handled the switch to previously-unseen test data rather well.&nbsp;</p><p><strong>5. Collaboration UX</strong></p><p>What made this competition so addictive and fun were 1) people and 2) the Kaggle UX. <a href="https://www.kaggle.com/cdeotte/">Chris Deotte</a>, a data scientist at Nvidia, deserves a special call out. He shared countless notebooks with new ideas and answered dozens of questions from folks who were just starting out. Unfortunately for Chris, the &#8220;shake-down&#8221; in the final scoring of the submissions (see Learning #2) affected him similarly to other top participants, but he promised to continue sharing and helping in future competitions. Fingers crossed.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5f3C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5f3C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png 424w, https://substackcdn.com/image/fetch/$s_!5f3C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png 848w, https://substackcdn.com/image/fetch/$s_!5f3C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png 1272w, https://substackcdn.com/image/fetch/$s_!5f3C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5f3C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5f3C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png 424w, https://substackcdn.com/image/fetch/$s_!5f3C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png 848w, https://substackcdn.com/image/fetch/$s_!5f3C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png 1272w, https://substackcdn.com/image/fetch/$s_!5f3C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5071a6aa-aaad-4667-8e0c-a63a6ac82b5a_904x904.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Since I am working on compute options for Data Science teams in my day-to-day job at Databricks, I paid attention to the Kaggle UX. It was largely conducive to a fun learning and competitive environment, with some exceptions:</p><ul><li><p>Collaboration in Kaggle is done mostly by cloning Notebooks and sharing Datasets. Users start discussions and share their notebooks that then get cloned and improved by others.&nbsp;</p></li></ul><ul><li><p>There is a lot of bad notebook code out there, and Kaggle does not help with refactoring of it. Notebooks get cloned many many times, code gets modified, and by clone #99 the notebook is full of&nbsp; code cells from various other notebooks, and training and inference code jammed into one notebook, and controlled by variables. Ease of collaboration is important, but so is readability of code.</p></li><li><p>In modern data platforms notebook sharing and collaboration is a common feature. As far as I know, no platform allows smart editing of the same notebook by multiple contributors that actually results in an easily readable artifact. Take note, Databricks, Snowflake, and all you Jupyter clones!</p></li><li><p>I saw almost no usage of the Models artifact. Instead, users create file snapshots of their models (sometimes as pickled objects), and make Datasets out of them. These model-datasets are then imported by other users as files and loaded into memory using the joblib or dill libraries. Presumably, if users were encouraged to share their models as Models (via API calls) instead of pickled objects, the code would be a bit easier to read.</p></li></ul><p><em>Insights: In future collaborative platforms I see teams submitting PRs against notebooks, instead of cloning them. The PRs would be reviewed by the original notebook author and the creator of the PR would get credited by points if the PR were approved. This would improve readability of code, and prevent situations where random participants gain strong positions in the leaderboard just by cloning a good publicly shared notebook.&nbsp;</em></p><p><em>What&#8217;s next, you ask?&nbsp;</em></p><p>Have I mentioned that Kaggle is addictive? There is a new top competition on Kaggle &#8211; the <a href="https://www.kaggle.com/competitions/arc-prize-2024">ARC Prize</a> &#8211; and I intend to participate and get some AGI-&#8220;achieving&#8221; done. Expect a new report in a few months.&nbsp;</p>]]></content:encoded></item><item><title><![CDATA[Compound AI Systems: Building a GitHub bot with Llama 3 and dltHub]]></title><description><![CDATA[TL;DR: We built a bot that suggests a meaningful response to an ongoing conversation thread on GitHub.]]></description><link>https://www.datancoff.ee/p/compound-ai-systems-building-a-github-bot-with-llama-3-and-dlthub</link><guid isPermaLink="false">https://www.datancoff.ee/p/compound-ai-systems-building-a-github-bot-with-llama-3-and-dlthub</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Tue, 23 Apr 2024 12:25:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/93b57570-770d-413b-a8a8-faebe5e8caef_1024x325.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>TL;DR:</strong> We built a bot that suggests a meaningful response to an ongoing conversation thread on GitHub. This bot can serve as a coding and design assistant or an initial evaluator of feature requests. In addition to generating a response using Meta&#8217;s recently released Llama 3 model, the bot also offers guardrails for its outputs by evaluating the response as being Safe or Not Safe For Work (NSFW). Our bot illustrates the idea of <a href="https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/">compound AI systems</a>, with two AI models being used for specialized tasks, an opinionated data conversion step, and data readers and writers based on <a href="https://dlthub.com/">dltHub</a>.&nbsp;&nbsp;</p><h2>Sample GitHub feature request issue and bot responses</h2><p>To demonstrate the workings of the bot, let&#8217;s look at an actual feature request submitted on dltHub&#8217;s dlt repo on GitHub (issue <a href="https://github.com/dlt-hub/dlt/issues/933">933</a>). dltHub is an open-source project hosted on GitHub that frequently gets contributions from its community.</p><pre><code>Feature description
dlt should support Postgres database replication. This should be efficiently done, using a CDC-based approach leveraging the transaction log (WAL)&#8212;not by querying the actual data directly.
Implementation should be as generic as possible, so database replication support for other databases (e.g. SQL Server) can easily be added later.
[EDITED FOR BREVITY]</code></pre><p>The response generated by our bot before taking into account the responses by repo maintainers and the FR submitter.&nbsp;</p><pre><code>Implementing PostgreSQL database replication support in dlt using a Change Data Capture (CDC) approach, leveraging the transaction log (WAL), is a great idea! This will not only improve efficiency but also enable future support for other databases like SQL Server.

To achieve this, we can follow these high-level steps:

1. **Establish a connection to the PostgreSQL database**: Use a library like `pg` in Python to connect to the PostgreSQL database and retrieve the transaction log (WAL) files.
2. **Parse the transaction log (WAL) files**: Use a library like `pg_wal` in Python to parse the WAL files and extract the necessary information, such as transaction IDs, timestamps, and affected tables.

[EDITED FOR BREVITY]</code></pre><p>And here is the bot response after taking into account all 13 comments that were added to this issue, with the last comment being from the original submitter of the FR:</p><p>Last Comment (<a href="https://github.com/dlt-hub/dlt/issues/933#issuecomment-1963039854">Link</a>):</p><pre><code>@codingcyclist
re. selecting columns: I had a little chat with @rudolfix about this on Slack. Schema contracts are still experimental and likely to change significantly. Hence, not a good idea to rely on this for now. Also, it's more efficient if we can do the filtering further upstream. My idea is to include an argument on the replication resource that let's you provide a column list, which is then used to filter already in the resource.</code></pre><p>Bot response:</p><pre><code>That makes sense. I didn't know that schema contracts were still experimental. And you're right, it's better to do the filtering upstream rather than relying on the `discard_row` and `discard_value` options.

So, if I understand correctly, you're proposing to add an argument to the replication resource that allows users to specify a list of columns to include in the replication. This would then be used to filter the data in the resource, rather than relying on schema contracts. Is that correct?</code></pre><p>Bot&#8217;s evaluation of the response being &#8220;Safe For Work&#8221;:&nbsp;</p><pre><code>{'label': 'SFW', 'score': 0.7747769355773926}</code></pre><h2>Implementation Chain&nbsp;</h2><p>We broke down our implementation into 5 components that match the steps in our &#8220;compound AI system&#8221; chain:&nbsp;</p><ul><li><p>Read from Github using <code>ReadGithubIssue</code> ,&nbsp;</p></li><li><p>Convert the Github thread into a chat template that Llama 3 can understand using <code>GithubIssueToChat</code> ,&nbsp;</p></li><li><p>Generate a proposed response to chat messages with the newest Llama 3 LLM model with&nbsp; <code>InferWithLlama3Instruct</code> ,</p></li><li><p>Classify the response into Safe or Not Safe for Work using a text classification model from Hugging Face with <code>InferWithHuggingface</code>, and, lastly,&nbsp;</p></li><li><p>Write our output to files using <code>WriteFile</code>.&nbsp;</p></li></ul><p>In the future, we want, of course, to replace the writes to files with a write to Github itself, by creating a new comment.</p><p>We start by importing the little wrapper components that we wrote: the data readers and writers, the inference classes, and the data converters.</p><pre><code>from core.readers.github import ReadGithubIssue
from core.transforms.github import GithubIssueToChat
from core.inference.llama3instruct import InferWithLlama3Instruct
from core.inference.huggingface import InferWithHuggingface
from core.writers.dlt import WriteFile
import os


# Get the Huggingface token
HF_TOKEN = os.getenv("HF_TOKEN")</code></pre><p>All our data readers and writers are based on <a href="https://dlthub.com/">dltHub</a> (on <a href="https://github.com/dlt-hub/dlt">Github</a>), an awesome Python library that runs entirely in-process, requires no dedicated compute infrastructure, and solves a bunch of real problems like pipeline restarts, schema detection, mapping and evolution. &#8220;Dlt&#8221; btw stands for &#8220;Data Load Tool&#8221;, not the Delta Live Tables offered in Apache Spark.</p><pre><code># Create a Github reader action that will read from Github and store in memory (readitems attributes of the class)
# This reader will read the main issue and all comments for a specific issue number
read_issue = ReadGithubIssue()
read_issue(repo_owner="dlt-hub", repo="dlt", issue_number=933)
issues= read_issue.readitems['issues']
comments = read_issue.readitems['comments']</code></pre><p>After we get the issue record and all its comments, we have to convert them into a format that Llama 3 will understand. This is a standard <a href="https://platform.openai.com/docs/guides/text-generation/chat-completions-api">chat messages format</a> popularized by OpenAI, with each message in its own python dictionary, with roles and content fields.</p><pre><code># Start building inputs into an LLM. Issues and comments get converted into a chat-like thread
# Each message in the chat are tagged with one of 3 roles: system, assistant, user
# The first message is by 'system' and instructs our LLM how to behave
messages = []
messages.append({"role": "system", "content": "You are a coding assistant that answers user questions posted to GitHub!"})


# Now convert all comments from Github that we read earlier
convert_gh_issue = GithubIssueToChat()
chat = convert_gh_issue(issues=issues,comments = comments)
messages.extend(chat)</code></pre><p>We then run our inference using Meta&#8217;s Llama 3 model on a Macbook Pro that has no problems fitting the 8 billion parameter version of the model (it needs about 20 GB of RAM). We use the Apple Silicon GPUs (&#8220;mps&#8221; device). Loading the model (first line of code) takes about 10 seconds; the time to run inference on chat messages and generate a response depends on the total size of these messages. It takes somewhere between a few seconds to a few dozen seconds (when the content length is about 4K words).&nbsp;</p><pre><code># Pass this chat message list to the Llama3 model and get a response
infer_with_llama3 = InferWithLlama3Instruct(HF_TOKEN,"mps")
response = infer_with_llama3(messages)</code></pre><p>As the second inference call in our chain we run the response through a text classification model <a href="https://huggingface.co/michellejieli/NSFW_text_classifier?not-for-all-audiences=true">&#8220;michellejieli/NSFW_text_classifier&#8221;</a> that checks if the response is Safe or Not for Work.</p><pre><code># Check that our response is safe to use at work.
# Generate an NSFW score using a model on Huggingface
filter_nsfw = InferWithHuggingface(task="text-classification", model="michellejieli/NSFW_text_classifier", device="mps")
nsfw_score = filter_nsfw(response)[0]</code></pre><p>We merge the response and the NSFW classification scores and prepare two output files &#8211; one for the response message itself, and another one for the full contents of the chat, including the issue text, all comments, and the proposed bot response. The data will be written in json format.</p><pre><code># Start preparing outputs.
response_message = {"role": "assistant", "content":response}
response_message = {**response_message, **nsfw_score}
messages.append(response_message)


# Write the last response and the full chat to local files
write_last_response = WriteFile(
   "github_bot", bucket_url="file://gh_bot_last_response")
write_last_response([response_message],loader_file_format="jsonl")


write_full_chat = WriteFile(
   "github_bot", bucket_url="file://gh_bot_full_chat")
write_full_chat(messages,loader_file_format="jsonl")</code></pre><p>The source code for this implementation is available in this <a href="https://gist.github.com/datancoffee/d364ee45582db192768a9a3fade4725d">gist</a>.</p><p>Now that we know what the bot does and what its design is, let&#8217;s dive deeper into the concept of compound AI systems on which it was built.</p><h2>Background</h2><p>In a <a href="https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/">paper</a> published this February, Matei Zaharia (CTO of Databricks) and a bunch of smart folks at Berkeley AI Research (BAIR) and Databricks described a new trend that they were seeing in the market.&nbsp;</p><p>Instead of a single, monolithic AI model being used to achieve state-of-the-art results (think a chat interface to ChatGPT where you type in something, and you immediately get a result from the model), they saw longer chains of components working together to produce the best results. If you are familiar with data processing pipelines, imagine multiple processing steps being chained together, where outputs of one step become input of another. What are these components, you might ask? Think of search engines, data format converters, data readers and writers, and, quite often, different models being used one after another.&nbsp;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Vf_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Vf_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png 424w, https://substackcdn.com/image/fetch/$s_!2Vf_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png 848w, https://substackcdn.com/image/fetch/$s_!2Vf_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png 1272w, https://substackcdn.com/image/fetch/$s_!2Vf_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Vf_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png" width="1024" height="325" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:325,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2Vf_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png 424w, https://substackcdn.com/image/fetch/$s_!2Vf_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png 848w, https://substackcdn.com/image/fetch/$s_!2Vf_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png 1272w, https://substackcdn.com/image/fetch/$s_!2Vf_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff47d506e-2d2b-4b25-baf0-c5393b979bdc_1024x325.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Source: BAIR paper</p><h2>Examples of Compound AI Systems</h2><p><strong>Better Inputs</strong></p><p>RAG (retrieval-augmented generation) systems are the best-known (and relatively simple) examples of compound AI systems. In a RAG system, there is a retriever and a generator. Questions from users don&#8217;t go directly as input into a model (the generator in this system). Instead, there is a search step (the retriever), where a database is searched for content relevant to the user&#8217;s input (usually, in a vector database). Documents that are relevant to the user inputs are brought together into what is known as &#8220;context&#8221; and added to the user question. A simple user question &#8220;What is my current sales in EMEA?&#8221; becomes a much longer piece of text, with recent Google Docs, PDF files and presentations, all from the corporate file storage system, added to the question. This is important, because the generic model from an AI vendor does not have the latest information and knows very little about your own business. By adding context, one can make a generic AI model answer questions about something very recent and non-public.&nbsp;</p><p>The length of these documents can be quite significant, measured in thousands of words (for comparison, 1 page of English text is ~ 500 words). That&#8217;s why you see many Generative AI model builders (Meta, OpenAI, Anthropic, you name it) bragging about the &#8220;context length&#8221; of their models. The recently released <a href="https://huggingface.co/databricks/dbrx-instruct">Databricks DBRX</a> model boasts a maximum 32K token context length. That&#8217;s about 65 pages of internal, non-public, up-to-date information that you can add to your question about EMEA sales.</p><p><strong>Safer Outputs</strong></p><p>Another example of components used in Compound AI Systems are so-called Guardrails, in other words, tools that control the outputs of a LLM model (several are <a href="https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/#emerging-paradigms">mentioned</a> in the February BAIR paper). Whereas in RAG systems we improve the results by feeding better inputs into the model, the Guardrail components make results better by preventing potentially incorrect or embarrassing outputs.&nbsp;</p><p><em><strong>We use one of such guardrails in our own Github bot implementation &#8211; it checks if the response is Safe or Not Safe for Work using a text classification model.&nbsp;</strong></em></p><h2>Frameworks for Building Compound AI Systems</h2><p>Two recent projects have received significant attention for proposing a new syntax for expressing chains of components and by offering component libraries that implement common actions or tasks: LangChain and, to a lesser degree, Metaflow.</p><p><a href="https://www.langchain.com/">LangChain</a> cleverly uses the Python | (pipe) operator override capability to bring back the syntax that folks who are familiar with Unix pipes, Apache Beam, and <a href="https://github.com/twitter-archive/pycascading">PyCascading</a> (by Twitter) find very appealing. Since we count ourselves among these folks, we can&#8217;t wait until the next long weekend to hash out the __or__ implementation in Python that would allow us to define our Github bot as:</p><pre><code>  ReadGithubIssue()( &#8230; params &#8230; ) 
| GithubIssueToChat()( &#8230; params &#8230; )
| InferWithLlama3Instruct()( &#8230; params &#8230; )
| InferWithHuggingface()( &#8230; params &#8230; )
| WriteFile()( &#8230; params &#8230; )</code></pre><p>If the above did not quite make sense, review <a href="https://stackoverflow.com/questions/25211477/what-is-the-purpose-of-pythons-built-in-bool-method-ror">this SO discussion</a> on Python operator overloading.</p><p>The idea behind this syntax is that one writes the steps of the pipeline one after another, and that the outputs of the previous step become the inputs of the following step.</p><p>Fun fact: the author of this blog onboarded the data pipelines teams from Twitter as they were migrating from the <a href="https://www.cascading.org/">Cascading</a> framework to Apache Beam running on Cloud Dataflow. Luckily for these folks, they could continue using the Python pipe operator in Beam to write marvels like the one above.</p><p><a href="https://metaflow.org/">Metaflow</a> (open-sourced by Netflix) offers an alternating view on defining chains of actions. By using the @step decorator (and several others) on a function in Python one can define a complete pipeline that combines data processing and ML inference steps.&nbsp;</p><p>With an important part of the industry coalescing around the idea of chains of AI and data processing components, it becomes clear that we don&#8217;t have to bet the house to improve the quality of our AI systems. Instead of training an ever larger ChatGPT model, we can combine several smaller ones into something equally good.&nbsp;&nbsp;&nbsp;</p><p>We decided to implement a useful case that would demonstrate the idea of compounded AI systems.</p><h2>Meta Llama 3</h2><p>Something else happened very recently that jolted us to action. Meta has finally released their Llama 3 model, which the whole industry was waiting for. The model comes in two variants &#8211; the smaller 8B parameter model, and the larger 70B model &#8211; and is available for online inference at <a href="https://www.databricks.com/blog/building-enterprise-genai-apps-meta-llama-3-databricks">Databricks Model Serving</a>. Interestingly, the 8B parameter model requires less than 20GB of memory when doing inference, which is totally doable for local inference on a Macbook Pro.&nbsp;</p><p>Within 24 hours of Llama 3 hitting Hugginface and being announced by Meta&#8217;s partners (see Databricks <a href="https://www.databricks.com/blog/building-enterprise-genai-apps-meta-llama-3-databricks">blog</a>) the rumors started circulating that it was quite good.&nbsp;</p><p>We knew what we had to do. We would use a small, but state-of-the-art LLM model and combine it with other models and business logic.</p><h2>Looking for a realistic use case</h2><p>To make an evaluation more realistic, we thought of some real-life tasks that businesses would want to get solved with AI. We are dealing a lot with application developers, and they live and breathe GitHub and all the cool projects that are out there. A very common workflow in GitHub are developers opening Issues and proposing new features. These feature proposals sometimes result in long comment threads, where project maintainers evaluate the proposals and where the community chimes in. A bot that would automatically evaluate a Feature Request, or propose a response to a user&#8217;s question might be quite useful to repo owners and users of Github. As an added bonus, the Github API is open and free when used in reasonable quantities.</p><p>We had our use case. We would build a bot for GitHub repo owners.</p><h2>Getting data in and out</h2><p>The best AI system is nothing without good data. While most data frameworks offer common connectors to things like databases and file systems, application-specific connectors are harder to find. Here&#8217;s where <a href="https://dlthub.com/">dltHub</a> comes in. Their Data Load Tool (<code>dlt</code>) looks deceptively simple (it packs a big punch once you dive deeper into it). One picks a source and a destination, and the tool generates an initial pipeline that can be customized later. The following commands will install <code>dlt</code> and then initialize source code and config files to read from github and write to files.</p><pre><code>pip install dlt
dlt init github filesystem</code></pre><p>The thing that differentiates <code>dlt</code> from all the other ETL tools out there is that it&#8217;s both a code generation tool and a Python library. This library can run entirely in-process of your main application &#8211; you don&#8217;t need a separate compute infrastructure to run your data pipelines. This library also takes care of one of the hardest things in software engineering, besides naming and pricing, of course, that of schema detection and evolution. The tool detects schemas at the source (including nested fields) and is able to replicate them at the destination, without you having to specify the mapping rules.</p><p><code>dlt</code> comes with dozens of pre-defined sources and destinations, but it&#8217;s quite easy to define new ones as well. In fact, the project positions itself as an Extract-Load-Transform (ELT) tool for the long tail of data sources and destinations. Because if was originally designed to be used in the ELT pattern, it does the E and the L, and delegates the T (transformations) to tools like <code>dbt</code> or <code>Airflow</code> or to the processing capabilities of the destination (e.g. if the destination is Databricks, you could use Databricks SQL to transform your data once you loaded it into the bronze layer of a Delta Lake).</p><p>However, for our bot implementation we needed to load data into memory in order to pass it to our model inference components. The results of inference would also be in memory, and they needed to be eventually written to some external destination. We decided to use the dlt framework and build a new dlt destination (and a new source) that would reside entirely in-memory.&nbsp;</p><p>We created a <code>Read</code> class that can take as an input any valid Dlt source and read the contents into two of its attributes: <code>readitems</code> and <code>readtableschema</code>.</p><pre><code>import dlt
from dlt.common.typing import TDataItems
from dlt.common.schema import TTableSchema
from dlt.common.destination import Destination
from typing import Any
from core.actions import Action


class Read(Action):


   def __init__(self, actionname: str = None):
       super().__init__(actionname)
       # needs to say from_reference("destination"... to work
       self.dltdestination = Destination.from_reference(
           "destination",
           destination_name=self.actionname+"_destination",
           destination_callable=self.read_destination)
       self.dltpipeline = dlt.pipeline(
           self.actionname+"_destination_pipeline",
           destination = self.dltdestination)


       self.clean_state()


   def read_destination(self, items: TDataItems, table: TTableSchema) -&gt; None:
       tablename = table["name"]
       if tablename not in self.readitems:
           self.readitems[tablename] = []
       self.readitems[tablename].extend(items)
       self.readtableschema[tablename] = table


   def clean_state(self):
       self.readitems={}
       self.readtableschema={}


   def do(self, *args:Any, **kwargs: Any):
       self.clean_state()
       self.dltpipeline.run(*args, **kwargs)</code></pre><p>We then defined <code>ReadGithub(Read) </code>and <code>ReadGithubIssue(ReadGithub)</code> classes that implemented an abstraction that allowed us to be very succinct when specifying steps in our compound system chain.&nbsp;&nbsp;</p><p>Here is the definition of the <code>ReadGithubIssue(ReadGithub)</code> class, in case you were wondering.</p><pre><code>class ReadGithubIssue(ReadGithub):


   def do(self, repo_owner, repo, issue_number, *args:Any, **kwargs: Any):
       base_github_url = f"https://api.github.com/repos/{repo_owner}/{repo}"
       issues_endpoint = self.build_entityspec(
            entity="issues", issue_number=issue_number)
       comments_endpoint = self.build_entityspec(
            entity="comments", issue_number=issue_number)
       endpoints = [issues_endpoint,comments_endpoint]
       super().do(base_github_url, endpoints, *args, **kwargs)</code></pre><p>And here is all the code that one needs to write to pull Issue 933 from the repo &#8220;dlt&#8221; of repo owner &#8220;dlt-hub&#8221;.&nbsp;</p><pre><code># Create a Github reader action that will read from Github and store in memory (readitems attributes of the class)
# This reader will read the main issue and all comments for a specific issue number
read_issue = ReadGithubIssue()
read_issue(repo_owner="dlt-hub", repo="dlt", issue_number=933)
issues= read_issue.readitems['issues']
comments = read_issue.readitems['comments']</code></pre><p>Similarly, we also wrote a <code>WriteFile(Action)</code> class that takes an memory object (a list), and writes it to either local or cloud-based files.</p><p>The source code for this class can be reviewed in this <a href="https://gist.github.com/datancoffee/18accb81a54689fe69d2732203c2bef5">gist</a>.</p><p>This abstraction allows us to write two lines of code to save the contents of an in-memory object in a file.&nbsp;</p><pre><code>write_full_chat = WriteFile("github_bot", bucket_url="file://gh_bot_full_chat")
write_full_chat(messages,loader_file_format="jsonl")</code></pre><p>While this, by itself, is no big news, the power of dlt gives us various options</p><ul><li><p>Format: jsonl, parquet, csv and INSERT VALUES (sql script format)</p></li><li><p>Write disposition: append, replace, merge</p></li><li><p>Destination: local disk, S3, google storage or azure blob storage</p></li></ul><h2>What about the mysterious Action class?</h2><p>Throughout the code samples in this report we used a base class <code>Action</code> from which many other classes were derived. This is a super simple piece of code, and all it does are two things:</p><ul><li><p>It defines a constructor __init__ where the initialization of the objects is going to happen.</p></li><li><p>And it defines a <code>&#8220;do&#8221;</code> function that &#8220;does all the doing&#8221; of the Action class</p></li></ul><pre><code>from typing import Any


class Action:
   def __init__(self, actionname: str = None):
       if actionname is not None:
           self.actionname = actionname
       else:
           self.actionname = "UnnamedAction"


   def do(self,*args:Any, **kwargs: Any):
       pass


   # ThisAction(params) or ThisAction.do(params)
   def __call__(self,*args:Any, **kwargs: Any):
       return self.do(*args, **kwargs)</code></pre><p>The additional <code>__call__</code> function is just syntactic sugar that allows us to write&nbsp;</p><pre><code>write_full_chat(messages,loader_file_format="jsonl")</code></pre><p>Instead of</p><pre><code>write_full_chat.do(messages,loader_file_format="jsonl")</code></pre><p>The purpose of this Action class hierarchy is to enable chaining of Action objects down the road. If every component has the same super-simple interface (the &#8220;do&#8221; function), it will be easier for us to write concise and readable code. In this regard we are following the footsteps of the Great (= LangChain, Beam, PyCascading).&nbsp;</p><p>LangChain defines an interface &#8211; <a href="https://js.langchain.com/docs/expression_language/interface">Runnable</a> &#8211; from which all other components are derived. LangChain requires all Runnable-derived components to specify three functions that allow chaining:&nbsp;</p><ul><li><p><code>stream</code>: stream back chunks of the response</p></li><li><p><code>invoke</code>: call the chain on an input</p></li><li><p><code>batch</code>: call the chain on a list of inputs</p></li></ul><p>We believe our &#8220;do&#8221; function can represent both &#8220;invoke&#8221; and &#8220;batch&#8221; (if we treat each input as a list) and quite possibly the &#8220;stream&#8221; function as well.&nbsp;</p><p>Our goal is to be able to write our bot in five lines of code!</p><pre><code>  ReadGithubIssue()( &#8230; params &#8230; ) 
| GithubIssueToChat()( &#8230; params &#8230; )
| InferWithLlama3Instruct()( &#8230; params &#8230; )
| InferWithHuggingface()( &#8230; params &#8230; )
| WriteFile()( &#8230; params &#8230; )</code></pre><h2>Summary and Future Plans</h2><p>It was fun playing with the latest Llama 3 model and evaluating state-of-the-art concepts (like the compound AI systems) and tools (like LangChain and dltHub). As someone who built two startups in the Natural Language Processing space in the 2010s, we are impressed by the progress in language understanding, although most of us probably still can&#8217;t comprehend how these Generative models actually produce their output.</p><p>The big headline of this report is that Llama 3 is indeed awesome. Its 8B parameter version is a small but powerful LLM that can be used for implementing realistic use cases. However, the text it generates can sometimes appear to just reword the contents of the whole chat, or state facts that don&#8217;t add value. Improving the quality of its outputs through techniques such as retrieval augmentation or fine-tuning should make it better for this use case.</p><p>The other big headline is that chained or compound systems are instinctively better choices for engineers trying to improve their applications with AI. If we were to make a suggestion to the authors of the BAIR paper, we would use the term &#8220;compound systems using AI&#8221; instead of &#8220;compound AI systems&#8221;. And if we could make a suggestion to our friends at LangChain, it would be to rename itself to AppChain. This would reflect what these chains really are &#8211; full-blown applications that just happen to be using a lot of AI (but also a lot of data processing and business logic). But naming is hard, and we won&#8217;t be sad if our suggestions are not acted on.</p><p>There are also a bunch of things that we have an appetite for doing in the near term:</p><ul><li><p>First, we did not have time to implement a Retriever for our LLM generation step. This retriever could be populated with user documentation from each individual repo. When we build the retriever, we will be ready to move from local execution on our Macbooks to an actual development platform like Databricks and use the <a href="https://www.databricks.com/product/machine-learning/vector-search">Databricks Vector Search</a> feature,</p></li><li><p>Second, we are itching to fine-tune the Llama 3 model with the Issues/Comments training data set built from previous conversations in dltHub&#8217;s repo. This would give the LLM model the &#8220;voice&#8221; of the repo maintainers,</p></li><li><p>Third, while the 8B Llama 3 model was great, why not use its larger cousin, the 70B parameter one. The larger model won&#8217;t fit into a laptop&#8217;s memory anymore, so we will use <a href="https://www.databricks.com/product/model-serving">Databricks Model Serving</a> for that,</p></li><li><p>Forth, we want make our bot an actual bot and write into Github directly,</p></li><li><p>Fifth, we want to improve our Action framework to allow better chaining.</p></li></ul><p>It&#8217;s going to be fun to make it all work!</p><h2>Code</h2><p>Main gist for <a href="https://gist.github.com/datancoffee/d364ee45582db192768a9a3fade4725d">Github bot</a></p><p><a href="https://gist.github.com/datancoffee/eb60390fcd4deda25feb42c4baa336ff">Actions</a> class</p><p>Generic <a href="https://gist.github.com/datancoffee/684927c06630af7ac62b29ef47da694f">Read</a> class</p><p><a href="https://gist.github.com/datancoffee/18accb81a54689fe69d2732203c2bef5">WriteFile</a> class</p><p><a href="https://gist.github.com/datancoffee/b0c46d9d04c5584ce2a1847f515de2eb">GithubIssueToChat</a> class</p><p><a href="https://gist.github.com/datancoffee/c8e1263c00e34465e65294f0f3923674">ReadGithubIssue</a> class</p><p><a href="https://gist.github.com/datancoffee/dd2257c2d18a45ea5eb9ceeaf4cde7f1">Llama3 wrapper</a> class</p><p>Generic class for inference on <a href="https://gist.github.com/datancoffee/4575e8e8900264546051a7b0b53eb8fa">Hugging Face models</a></p>]]></content:encoded></item><item><title><![CDATA[A DuckDB moment for application servers?]]></title><description><![CDATA[Apple announced their new high-end Mac Pro desktop with 24 CPU cores, up to 76 GPU cores, 192 GB memory and 800GB/s of system memory bandwidth (for comparison: the top end infiniband, if I am not mistaken, is 400GB/s !).]]></description><link>https://www.datancoff.ee/p/a-duckdb-moment-for-application-servers</link><guid isPermaLink="false">https://www.datancoff.ee/p/a-duckdb-moment-for-application-servers</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Tue, 06 Jun 2023 18:11:36 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/aa58e17d-0719-4690-bca7-771cc2f6002b_336x746.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Apple announced their <a href="https://www.apple.com/newsroom/2023/06/apple-introduces-m2-ultra/">new high-end Mac Pro</a> desktop with 24 CPU cores, up to 76 GPU cores, 192 GB memory and 800GB/s of system memory bandwidth (for comparison: the top end infiniband, if I am not mistaken, is 400GB/s !). The price, I hear, is $10k US.</p><p>The specs are slightly ridiculous (100 cores and memory larger than some of the biggest boxes on AWS). This calls the question &#8211; are we ready for a DuckDB-like movement to application servers running under our desks? Just like it was in the 2000s with the Sun Sparcs and the Unixwares and the baby-AS400s.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Jzd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Jzd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Jzd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Jzd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Jzd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Jzd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg" width="336" height="746" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:746,&quot;width&quot;:336,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6Jzd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg 424w, https://substackcdn.com/image/fetch/$s_!6Jzd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg 848w, https://substackcdn.com/image/fetch/$s_!6Jzd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!6Jzd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526d55e8-fea4-4ff8-88ca-e97e4f0ed73e_336x746.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But nowadays, of course, with more modern app servers &#8211; Flink, Spark, even the good old Apache Beam.</p><p>Obviously, you would not run a production process that generates your payroll on a box near where your most dedicated employees take a nap or have a snack.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3bXS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3bXS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!3bXS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!3bXS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!3bXS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3bXS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg" width="675" height="900" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:675,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3bXS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!3bXS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!3bXS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!3bXS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc347e4c3-67e7-4d7e-83ef-7b5c339ea372_675x900.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But what if these high-performance desktops running new, modern app servers were organized in a mesh, exchanged jobs using a standard API, and were put to a better use than generating bitcoin ? Mind blown.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DtUC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DtUC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png 424w, https://substackcdn.com/image/fetch/$s_!DtUC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png 848w, https://substackcdn.com/image/fetch/$s_!DtUC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png 1272w, https://substackcdn.com/image/fetch/$s_!DtUC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DtUC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png" width="498" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:498,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DtUC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png 424w, https://substackcdn.com/image/fetch/$s_!DtUC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png 848w, https://substackcdn.com/image/fetch/$s_!DtUC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png 1272w, https://substackcdn.com/image/fetch/$s_!DtUC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5a4e2b8-a3cb-4c4d-8f49-dbc24cf384f1_498x330.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>]]></content:encoded></item><item><title><![CDATA[Time Value of Data: The Summit of Now and the Peak of Soon After]]></title><description><![CDATA[Last week I had the chance to visit a major global fashion retailer and give an industry talk on Real-time AI. This company was hosting Tech Days and invited a few of their vendors, including Snowflake, to give a talk to thousands of their tech employees joining in-person and online.]]></description><link>https://www.datancoff.ee/p/time-value-of-data-the-summit-of-now-and-the-peak-of-soon-after</link><guid isPermaLink="false">https://www.datancoff.ee/p/time-value-of-data-the-summit-of-now-and-the-peak-of-soon-after</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Sun, 02 Apr 2023 19:37:53 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/17c99592-b430-4181-aab2-f27203f8f786_1024x461.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last week I had the chance to visit a major global fashion retailer and give an industry talk on <strong>Real-time AI</strong>. This company was hosting Tech Days and invited a few of their vendors, including Snowflake, to give a talk to thousands of their tech employees joining in-person and online.&nbsp;</p><p>To get to their HQ I had to travel to a small town on the Galician coast of Spain. If you know, you know. The closer I got to the HQ, the more well-dressed folks I saw.</p><p>The view from my hotel room was pretty impressive. They say one must try an octopus when visiting.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eJPI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eJPI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!eJPI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!eJPI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!eJPI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eJPI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Real-time AI.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Real-time AI." title="Real-time AI." srcset="https://substackcdn.com/image/fetch/$s_!eJPI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!eJPI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!eJPI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!eJPI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe5e7c06-fcc6-4ba6-bd94-233a0a188f30_1024x461.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KK-n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KK-n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png 424w, https://substackcdn.com/image/fetch/$s_!KK-n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png 848w, https://substackcdn.com/image/fetch/$s_!KK-n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!KK-n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KK-n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Real-time AI.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Real-time AI." title="Real-time AI." srcset="https://substackcdn.com/image/fetch/$s_!KK-n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png 424w, https://substackcdn.com/image/fetch/$s_!KK-n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png 848w, https://substackcdn.com/image/fetch/$s_!KK-n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!KK-n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47015be5-8c48-48b0-885a-16a46a8c4471_821x1024.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>A nice gesture from event organizers was giving us a tour of the Steven Meisel exhibition, who worked a lot with this fashion retailer.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6f58!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6f58!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!6f58!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!6f58!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!6f58!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6f58!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Real-time AI.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Real-time AI." title="Real-time AI." srcset="https://substackcdn.com/image/fetch/$s_!6f58!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!6f58!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!6f58!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!6f58!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5877d30e-53eb-4657-aebf-a133909a48ac_1024x461.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qjFM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qjFM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!qjFM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!qjFM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!qjFM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qjFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Real-time AI.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Real-time AI." title="Real-time AI." srcset="https://substackcdn.com/image/fetch/$s_!qjFM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!qjFM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!qjFM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!qjFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02cc555f-3fd9-4e8c-a4d1-14a57468c58b_1024x461.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WF9s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WF9s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png 424w, https://substackcdn.com/image/fetch/$s_!WF9s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png 848w, https://substackcdn.com/image/fetch/$s_!WF9s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png 1272w, https://substackcdn.com/image/fetch/$s_!WF9s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WF9s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Real-time AI.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Real-time AI." title="Real-time AI." srcset="https://substackcdn.com/image/fetch/$s_!WF9s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png 424w, https://substackcdn.com/image/fetch/$s_!WF9s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png 848w, https://substackcdn.com/image/fetch/$s_!WF9s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png 1272w, https://substackcdn.com/image/fetch/$s_!WF9s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2f4460a-4641-4846-8305-5e2fcccf5614_1024x990.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>The campus, located in a small Galician village where the founder of the company is from, looked like as if Apple went on vacation to Sweden and came back wanting to build a summer residence.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AZkq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AZkq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png 424w, https://substackcdn.com/image/fetch/$s_!AZkq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png 848w, https://substackcdn.com/image/fetch/$s_!AZkq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png 1272w, https://substackcdn.com/image/fetch/$s_!AZkq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AZkq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Real-time AI.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Real-time AI." title="Real-time AI." srcset="https://substackcdn.com/image/fetch/$s_!AZkq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png 424w, https://substackcdn.com/image/fetch/$s_!AZkq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png 848w, https://substackcdn.com/image/fetch/$s_!AZkq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png 1272w, https://substackcdn.com/image/fetch/$s_!AZkq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcb5b1d6-d81a-4afd-9cd4-188b60d747bf_1024x1007.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I0Kd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I0Kd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!I0Kd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!I0Kd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!I0Kd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I0Kd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Real-time AI.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Real-time AI." title="Real-time AI." srcset="https://substackcdn.com/image/fetch/$s_!I0Kd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png 424w, https://substackcdn.com/image/fetch/$s_!I0Kd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png 848w, https://substackcdn.com/image/fetch/$s_!I0Kd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png 1272w, https://substackcdn.com/image/fetch/$s_!I0Kd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d883b85-c886-4c24-84ad-b94206b8f5ee_1024x461.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gZx4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gZx4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png 424w, https://substackcdn.com/image/fetch/$s_!gZx4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png 848w, https://substackcdn.com/image/fetch/$s_!gZx4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png 1272w, https://substackcdn.com/image/fetch/$s_!gZx4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gZx4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Real-time AI.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Real-time AI." title="Real-time AI." srcset="https://substackcdn.com/image/fetch/$s_!gZx4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png 424w, https://substackcdn.com/image/fetch/$s_!gZx4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png 848w, https://substackcdn.com/image/fetch/$s_!gZx4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png 1272w, https://substackcdn.com/image/fetch/$s_!gZx4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e20c1fd-4dc1-44fd-80ef-ab479a50e7ee_1024x624.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>But now back to the talk.</p><p>A few years back I began thinking about a topic that was just emerging at the time.&nbsp;</p><p>I started seeing more and more companies wanting to do more with their billions of events that they collected and stored. They wanted to react to them in real-time or near real-time. And they wanted to react with almost human intelligence, instead of just counting or aggregating or storing and forgetting.</p><p>This led me to naming the three driving forces in the market at the time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Doby!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Doby!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png 424w, https://substackcdn.com/image/fetch/$s_!Doby!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png 848w, https://substackcdn.com/image/fetch/$s_!Doby!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png 1272w, https://substackcdn.com/image/fetch/$s_!Doby!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Doby!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png" width="1024" height="421" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:421,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Time Value of Data&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Time Value of Data" title="Time Value of Data" srcset="https://substackcdn.com/image/fetch/$s_!Doby!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png 424w, https://substackcdn.com/image/fetch/$s_!Doby!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png 848w, https://substackcdn.com/image/fetch/$s_!Doby!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png 1272w, https://substackcdn.com/image/fetch/$s_!Doby!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eafa18a-5157-4ce3-bcd8-56b803dfc607_1024x421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h5><strong>Very large # of events</strong></h5><p>Global user base creates billions upon billions of interactions with systems</p><h5><strong>Real-time</strong></h5><p>Systems must serve intelligence in or near real-time, or opportunities are lost</p><h5><strong>Near-human intelligence</strong></h5><p>Systems must be able to apply near-human intelligence to super-human volumes of data</p><p>Why are businesses interested in reacting in real-time? This is best visualized by this picture.</p><h5><strong>Time Value of Data</strong></h5><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M35r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M35r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png 424w, https://substackcdn.com/image/fetch/$s_!M35r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png 848w, https://substackcdn.com/image/fetch/$s_!M35r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png 1272w, https://substackcdn.com/image/fetch/$s_!M35r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M35r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png" width="1024" height="530" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:530,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Time Value of Data&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Time Value of Data" title="Time Value of Data" srcset="https://substackcdn.com/image/fetch/$s_!M35r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png 424w, https://substackcdn.com/image/fetch/$s_!M35r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png 848w, https://substackcdn.com/image/fetch/$s_!M35r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png 1272w, https://substackcdn.com/image/fetch/$s_!M35r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0e04c4-f660-4cf3-aca3-2916a78c9260_1024x530.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Imagine this mountain range of business value. The rightmost peak is the Mountain of Wisdom. This is our traditional Data Lakes or Data Warehouses that hold events created over a long period of time. You can run analytical queries on them. You can train your ML models on them. You have years and years of data there. Businesses get a lot of value from that.</p><p>But in addition to that mountain of data there is also a lot of value if you can react to an event in the same instant it happens. The traditional example for that is fraud detection and cancellation of a transaction. This is the Summit of Now.</p><p>Reacting to an event instantaneously is not always possible, but that&#8217;s ok, because there are lots of use cases where you have between several seconds and several minutes to do something smart. This is the Peak of Soon After.</p><p>And this brings us to the definition of Real-time AI, and the techniques that implement it.</p><h5><strong>Near-human Intelligence</strong></h5><p>Let&#8217;s start with the easy part. It is clear that to address the need for near-human intelligence one needs to use Machine Learning. Dealing with billions of events manually, with heuristics, just would not scale.</p><h5><strong>Real-time</strong></h5><p>On the time dimension we have two different approaches.&nbsp;</p><p><strong>True Real-time:</strong> In cases when we need true Real-time, when intelligent responses need to happen in a second or less, Large-Scale Event Processing is best. Building pipelines is the wrong approach, because pipelines tend to micro-batch (yes, everyone is doing it, they just call it differently) and introduce latency. I happen to live in a country where Amazon Store cards are a bit stingy with credit. My limit is low for American standards and Amazon periodically cancels my orders when I exceed my store card limit, but they do it 1, 3, 5 minutes after they accept my order. Why they can&#8217;t check before accepting the order beats me. I sometimes imagine a Kinesis Data Firehose piping all these orders into some file, and a cron job periodically waking up and doing a micro-batch call to their Store Card bank to check the balance of all their recent customers. Not great. For best customer experience you want to be able to do inference on very small batches (if batching them at all) in the Cloud or even on the Edge, if you are in a brick-and-mortar retail or IoT situation.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cymo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cymo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png 424w, https://substackcdn.com/image/fetch/$s_!cymo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png 848w, https://substackcdn.com/image/fetch/$s_!cymo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png 1272w, https://substackcdn.com/image/fetch/$s_!cymo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cymo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Time Value of Data&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Time Value of Data" title="Time Value of Data" srcset="https://substackcdn.com/image/fetch/$s_!cymo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png 424w, https://substackcdn.com/image/fetch/$s_!cymo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png 848w, https://substackcdn.com/image/fetch/$s_!cymo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png 1272w, https://substackcdn.com/image/fetch/$s_!cymo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e4c313c-aa90-4670-b61d-44b8a8a83915_685x519.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Near Real-time:</strong> However, in a large class of use cases users have between several seconds to several minutes to act. For example, I know at least one Security Alarm vendor that does not react to every signal from a movement and glass-breaking sensor, or to a video camera reporting a movement. They wait multiple seconds, collecting all the signals from various rooms of the house, and then make an inference about an incident being a real case of break-and-entry or a false positive. There is a real cost to their business if they react to every false positive, and driving down the rate of false alarms keeps their costs in line. In cases like this, one can start using techniques customary to the domain of Streaming Analytics (time windows and such). One can bring different event data sources, one can even query historical data (or use it as a predictor). Using streaming pipelines or dumping raw events into Data Lakes and then doing aggregations and inference are all fine techniques here. If you feed these events into an analytical store, you even get an early start for forming your next peak in the data value mountain range &#8211; the Mountain of Wisdom!</p><p>Snowflake is moving fast to enable users to implement these Real-time and Near Real-time AI use cases. For Streaming Analytics Snowflake now offers in Public Preview <a href="https://docs.snowflake.com/en/user-guide/data-load-snowpipe-streaming-overview">Snowpipe Streaming</a>, a streaming ingestion solution that has a high insert rate (GB/sec per table) and low latency (data is queryable in seconds, &lt;5s being typical). We also launched in Public Preview something we call Dynamic Tables &#8211; a very simple way for defining streaming pipelines entirely in SQL, and by specifying one of the most important streaming Service Level Objectives (SLO) &#8211; the Lag (here is the Private Preview <a href="https://www.snowflake.com/blog/dynamic-tables-delivering-declarative-streaming-data-pipelines/">announcement</a>). And for Machine Learning we offer compute capabilities in <a href="https://docs.snowflake.com/en/developer-guide/snowpark/index">Snowpark</a> and are working on much more (come to our <a href="https://www.snowflake.com/summit/">Summit</a> or listen to announcements online).</p><p>Happy to see that the industry is also taking this trend seriously. Perhaps you should too.</p>]]></content:encoded></item><item><title><![CDATA[Graphext, data insights for non-data scientists]]></title><description><![CDATA[I am playing with Graphext &#8211; it&#8217;s like Trifacta, but with more powerful data science functionality.]]></description><link>https://www.datancoff.ee/p/graphext-data-insights-for-non-data-scientists</link><guid isPermaLink="false">https://www.datancoff.ee/p/graphext-data-insights-for-non-data-scientists</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Mon, 28 Nov 2022 10:05:28 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f01476c3-8899-4b4d-8cc5-ef97be1a5d0a_1024x890.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I am playing with <a href="http://graphext.com">Graphext</a> &#8211; it&#8217;s like Trifacta, but with more powerful data science functionality. If you are a product manager or finance person, and need to understand why revenue is down/up, what patterns are in your dataset, etc, this tool is pretty nifty.</p><p>I first loaded the Titanic survivors dataset that I <a href="https://www.kaggle.com/competitions/titanic/">downloaded</a> from Kaggle. 900 passengers in the training dataset and a few hundred in the test one. With attributes like age, ticket class, sex, number of relatives on board etc. And the most important variable &#8211; whether the passenger survived or not.</p><p>Here is a cluster analysis of the dataset, with about 20 clusters identified. Each dot is a passenger, and they are clustered using the built-in Catboost gradient boosted decision trees classifier. The predicted variable is &#8220;Survived&#8221; (yes or no).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bPeY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bPeY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png 424w, https://substackcdn.com/image/fetch/$s_!bPeY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png 848w, https://substackcdn.com/image/fetch/$s_!bPeY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png 1272w, https://substackcdn.com/image/fetch/$s_!bPeY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bPeY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png" width="1024" height="890" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:890,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;titanic survivor cluster analysis&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="titanic survivor cluster analysis" title="titanic survivor cluster analysis" srcset="https://substackcdn.com/image/fetch/$s_!bPeY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png 424w, https://substackcdn.com/image/fetch/$s_!bPeY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png 848w, https://substackcdn.com/image/fetch/$s_!bPeY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png 1272w, https://substackcdn.com/image/fetch/$s_!bPeY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6982af8d-044e-43ea-b2a1-a72663035205_1024x890.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At the start of my evaluation I wanted to see if it were possible to get a feeling for the data without writing any pandas code and using Jupyter, and then set up a predictive model for Survived using UI tools only. The tool largely worked for me, and after ~1.5 hours I felt like I discovered some non-obvious pattern in the dataset. For example, &#8220;Male passengers in the 3rd class and no siblings&#8221; were unfortunately not very lucky that day. Take a look at Cluster 10 and the side bars.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hwDw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hwDw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png 424w, https://substackcdn.com/image/fetch/$s_!hwDw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png 848w, https://substackcdn.com/image/fetch/$s_!hwDw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!hwDw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hwDw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png" width="982" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33206037-9386-450f-b3bb-31884d9ec351_982x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:982,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;titanic survivor cluster 10&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="titanic survivor cluster 10" title="titanic survivor cluster 10" srcset="https://substackcdn.com/image/fetch/$s_!hwDw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png 424w, https://substackcdn.com/image/fetch/$s_!hwDw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png 848w, https://substackcdn.com/image/fetch/$s_!hwDw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!hwDw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33206037-9386-450f-b3bb-31884d9ec351_982x1024.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Or, for example this Cluster 3 &#8220;Female passengers in 2nd and 3rd class&#8221;, 80% of whom were able to survive. Note how the UI helps telling the story. In the upper left corner we have the &#8220;significant variables&#8221; telling us that passenger class and sex are significant vars in this cluster. Down left we also have the Error bars and the Confusion Matrix. Then, on the right, we have a comparison of the predicted variable and the actual prediction (upper right). Below we see other variables.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-bb-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-bb-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png 424w, https://substackcdn.com/image/fetch/$s_!-bb-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png 848w, https://substackcdn.com/image/fetch/$s_!-bb-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!-bb-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-bb-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png" width="922" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:922,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;titanic survivor cluster 3&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="titanic survivor cluster 3" title="titanic survivor cluster 3" srcset="https://substackcdn.com/image/fetch/$s_!-bb-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png 424w, https://substackcdn.com/image/fetch/$s_!-bb-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png 848w, https://substackcdn.com/image/fetch/$s_!-bb-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!-bb-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5110400d-043e-4595-873b-1c8ef92623ba_922x1024.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Before doing the predictive model though, I played with variable distributions and correlations.<br>&#8211; Passenger class was pretty important for survival (3rd class was a bit lethal),<br>&#8211; sex had correlation too, and<br>&#8211; age was also important (see that big white circle for the age bucket 20-30?).<br>Btw, the tool did the bucketing for the continuous age variable automatically.<br>Other insights:<br>&#8211; if you had 0 siblings/spouses or parents/children (Sibsp and Parch), your survival was questionable</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IFUo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IFUo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png 424w, https://substackcdn.com/image/fetch/$s_!IFUo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png 848w, https://substackcdn.com/image/fetch/$s_!IFUo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png 1272w, https://substackcdn.com/image/fetch/$s_!IFUo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IFUo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png" width="1024" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;titanic survivor correlations of variables&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="titanic survivor correlations of variables" title="titanic survivor correlations of variables" srcset="https://substackcdn.com/image/fetch/$s_!IFUo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png 424w, https://substackcdn.com/image/fetch/$s_!IFUo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png 848w, https://substackcdn.com/image/fetch/$s_!IFUo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png 1272w, https://substackcdn.com/image/fetch/$s_!IFUo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fced8f094-e2c6-417f-a31a-16134e94a706_1024x713.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Initially, I started with just exploring the dataset. The &#8220;Data&#8221; tab is what reminded me of other data wrangling tools. In the middle of the screen you see your data set table, and on the right you see distributions of your columns. You can manually set the types of columns to boolean and categorical, which helps in predictive models. The distribution charts on the right allow you to select a range of values, and this auto-adjusts other columns (this reminded me of Snowflake).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!07XF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!07XF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png 424w, https://substackcdn.com/image/fetch/$s_!07XF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png 848w, https://substackcdn.com/image/fetch/$s_!07XF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png 1272w, https://substackcdn.com/image/fetch/$s_!07XF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!07XF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png" width="1024" height="874" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6fe29db8-2829-493d-af92-40b858746f87_1024x874.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:874,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;graphext UI&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="graphext UI" title="graphext UI" srcset="https://substackcdn.com/image/fetch/$s_!07XF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png 424w, https://substackcdn.com/image/fetch/$s_!07XF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png 848w, https://substackcdn.com/image/fetch/$s_!07XF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png 1272w, https://substackcdn.com/image/fetch/$s_!07XF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe29db8-2829-493d-af92-40b858746f87_1024x874.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It took me a bit to find how to add derived variables. Turns out one need to select the &#8220;Wizard&#8221; button.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a9gd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a9gd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png 424w, https://substackcdn.com/image/fetch/$s_!a9gd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png 848w, https://substackcdn.com/image/fetch/$s_!a9gd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png 1272w, https://substackcdn.com/image/fetch/$s_!a9gd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a9gd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png" width="224" height="64" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62643054-dded-4dab-9cd8-fbded7495e38_224x64.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:64,&quot;width&quot;:224,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!a9gd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png 424w, https://substackcdn.com/image/fetch/$s_!a9gd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png 848w, https://substackcdn.com/image/fetch/$s_!a9gd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png 1272w, https://substackcdn.com/image/fetch/$s_!a9gd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62643054-dded-4dab-9cd8-fbded7495e38_224x64.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><br>It then opens a panel with two enabled choices &#8220;Social media&#8221; and &#8220;Others&#8221; and under Others you find &#8220;Enrichments&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uol6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uol6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png 424w, https://substackcdn.com/image/fetch/$s_!uol6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png 848w, https://substackcdn.com/image/fetch/$s_!uol6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png 1272w, https://substackcdn.com/image/fetch/$s_!uol6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uol6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png" width="510" height="777" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:510,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!uol6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png 424w, https://substackcdn.com/image/fetch/$s_!uol6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png 848w, https://substackcdn.com/image/fetch/$s_!uol6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png 1272w, https://substackcdn.com/image/fetch/$s_!uol6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240342d9-dde1-493c-97b8-2c4308fa35c8_510x777.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The available enrichment types include sentiment analysis, entity extraction, fuzzy grouping, geolocation, and some external methods that require an API key. Btw, the origins of Graphext were in text analytics, hence all these text analysis options (me likey).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yQDx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yQDx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png 424w, https://substackcdn.com/image/fetch/$s_!yQDx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png 848w, https://substackcdn.com/image/fetch/$s_!yQDx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png 1272w, https://substackcdn.com/image/fetch/$s_!yQDx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yQDx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png" width="512" height="452" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:452,&quot;width&quot;:512,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yQDx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png 424w, https://substackcdn.com/image/fetch/$s_!yQDx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png 848w, https://substackcdn.com/image/fetch/$s_!yQDx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png 1272w, https://substackcdn.com/image/fetch/$s_!yQDx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff45d8687-8957-4fb2-a375-0d4d342e096d_512x452.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Still, I could not find an intuitive way to add a new variable based on just some data transformations. Hopefully that can be added soon.<br>Interestingly, while the tool positions itself as a &#8220;no-code tool for data scientists&#8221;, it does allow you to review the code for the &#8220;recipes&#8221; (hello Trifacta) that it creates as you transform columns and define models. As such, I see Graphext as a &#8220;tool for non-data scientists who want to do data science (but can&#8217;t remember the pandas API)&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SQI6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SQI6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png 424w, https://substackcdn.com/image/fetch/$s_!SQI6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png 848w, https://substackcdn.com/image/fetch/$s_!SQI6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!SQI6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SQI6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png" width="811" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:811,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!SQI6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png 424w, https://substackcdn.com/image/fetch/$s_!SQI6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png 848w, https://substackcdn.com/image/fetch/$s_!SQI6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!SQI6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1d053b-81c4-46f4-a29f-8e07ce118f3e_811x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>IMO, the largest 3 value adds of Graphext are:<br>&#8211; UI guided creation of prediction models<br>&#8211; graph visualizer for the identified clusters<br>&#8211; built-in reporting tool called &#8220;Insights&#8221;</p><p>The tool is still a little buggy and I could not directly create a model (on the Models tab). The &#8220;Create model&#8221; button was disabled. I found a way to create a model by going to the &#8220;Graph&#8221; tab. If you don&#8217;t see the Models and Graph tabs, hover over the tab bar and click on an icon that looks like an arrow.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RE6D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RE6D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png 424w, https://substackcdn.com/image/fetch/$s_!RE6D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png 848w, https://substackcdn.com/image/fetch/$s_!RE6D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png 1272w, https://substackcdn.com/image/fetch/$s_!RE6D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RE6D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png" width="495" height="65" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:65,&quot;width&quot;:495,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RE6D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png 424w, https://substackcdn.com/image/fetch/$s_!RE6D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png 848w, https://substackcdn.com/image/fetch/$s_!RE6D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png 1272w, https://substackcdn.com/image/fetch/$s_!RE6D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2d6480-7072-479f-81c6-b2cf293f36ce_495x65.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>On the Graph tab, click on the &#8220;Create Network&#8221; button.</p><p>A selection tab will open (remember it from the data &#8220;enrichment&#8221; step?), with many previously disabled options now being available.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!amtS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!amtS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png 424w, https://substackcdn.com/image/fetch/$s_!amtS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png 848w, https://substackcdn.com/image/fetch/$s_!amtS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png 1272w, https://substackcdn.com/image/fetch/$s_!amtS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!amtS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png" width="517" height="823" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:823,&quot;width&quot;:517,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!amtS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png 424w, https://substackcdn.com/image/fetch/$s_!amtS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png 848w, https://substackcdn.com/image/fetch/$s_!amtS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png 1272w, https://substackcdn.com/image/fetch/$s_!amtS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd3ac177-cad7-4ea0-bc69-17904753ecc8_517x823.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I recommend exploring all of them, because I liked what I saw there. There were training and prediction tasks, text analytics tasks (sentiment and entity extraction), commerce tasks (e.g. product basket analysis)&#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5ugw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5ugw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png 424w, https://substackcdn.com/image/fetch/$s_!5ugw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png 848w, https://substackcdn.com/image/fetch/$s_!5ugw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png 1272w, https://substackcdn.com/image/fetch/$s_!5ugw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5ugw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png" width="502" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72078440-2134-4acc-a3f6-78222172cfbe_502x386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:502,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5ugw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png 424w, https://substackcdn.com/image/fetch/$s_!5ugw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png 848w, https://substackcdn.com/image/fetch/$s_!5ugw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png 1272w, https://substackcdn.com/image/fetch/$s_!5ugw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72078440-2134-4acc-a3f6-78222172cfbe_502x386.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d4yx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d4yx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png 424w, https://substackcdn.com/image/fetch/$s_!d4yx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png 848w, https://substackcdn.com/image/fetch/$s_!d4yx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png 1272w, https://substackcdn.com/image/fetch/$s_!d4yx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d4yx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png" width="502" height="529" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:529,&quot;width&quot;:502,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!d4yx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png 424w, https://substackcdn.com/image/fetch/$s_!d4yx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png 848w, https://substackcdn.com/image/fetch/$s_!d4yx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png 1272w, https://substackcdn.com/image/fetch/$s_!d4yx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7399fbc8-d7d4-484d-9c3d-4787f814e395_502x529.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yKma!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yKma!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png 424w, https://substackcdn.com/image/fetch/$s_!yKma!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png 848w, https://substackcdn.com/image/fetch/$s_!yKma!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png 1272w, https://substackcdn.com/image/fetch/$s_!yKma!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yKma!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png" width="502" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:502,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yKma!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png 424w, https://substackcdn.com/image/fetch/$s_!yKma!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png 848w, https://substackcdn.com/image/fetch/$s_!yKma!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png 1272w, https://substackcdn.com/image/fetch/$s_!yKma!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2e0829a-6ac1-4632-bfd5-290d63e2656a_502x376.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Btw, this is where the user experience is still a little rough.</p><p>I wanted to create a few screenshots of how to create a clustering-based prediction model. But for the life of me I could not remember what option I chose initially on the selection panel. Was it the &#8220;Source and target network&#8221;? &#8220;Train and predict&#8221;? I rolled the dice and chose &#8220;Source and target network&#8221;, and of course ended up wasting time. I could not even roll back the resulting &#8220;recipe&#8221; and had to reimport data again.</p><p>I tried &#8220;Train and predict&#8221; and finally found what I was looking for.<br>Interestingly, the tool only offered two types of models: Catboost and Logistic Regression (was it because my predicted variable was Categorical?).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dTTv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dTTv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png 424w, https://substackcdn.com/image/fetch/$s_!dTTv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png 848w, https://substackcdn.com/image/fetch/$s_!dTTv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!dTTv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dTTv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png" width="489" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:489,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dTTv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png 424w, https://substackcdn.com/image/fetch/$s_!dTTv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png 848w, https://substackcdn.com/image/fetch/$s_!dTTv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!dTTv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bf37109-dd9e-4b4c-b746-97ea68a1aa75_489x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><br>One of the unique (to my knowledge) features of Graphext is the ability to visualize the prediction model as a Graph of Clusters, a Geo Map, or a Treemap. The Graph visualization ended up being very effective in interpreting the results (model explainability).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vbp2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vbp2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png 424w, https://substackcdn.com/image/fetch/$s_!vbp2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png 848w, https://substackcdn.com/image/fetch/$s_!vbp2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png 1272w, https://substackcdn.com/image/fetch/$s_!vbp2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vbp2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png" width="519" height="509" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:509,&quot;width&quot;:519,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vbp2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png 424w, https://substackcdn.com/image/fetch/$s_!vbp2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png 848w, https://substackcdn.com/image/fetch/$s_!vbp2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png 1272w, https://substackcdn.com/image/fetch/$s_!vbp2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4e22b9b-4269-49f5-9a6b-cf7384f97124_519x509.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Training the model took about 5 minutes. Graphext has an interesting architecture. The do the &#8220;fast tasks&#8221; (render graphs, data transformations etc) in the browser and the &#8220;slow tasks&#8221; (like model training) in the cloud. The browser takes on quite a bit of tasks, much more than in a traditional BI tool architecture. They even have a data engine in the browser called Durro that is able to render 10 million records as a graph. Durro is based on WebAssembly (aka WASM).</p><p>Btw, the resulting model had a 79.9% accuracy on an eval dataset. I liked how the tool highlighted the importance of various features: Sex, Age, Passenger Class. If you know the Titanic story, you will understand.</p><p>Graphext does the training on the cloud backend and claims to be able to use GPUs and preemptibles. They mention Apache Arrow as something they use to minimize the overhead on the backend, but are also hinting that their homegrown Durro engine will gradually replace Arrow.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iOqx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iOqx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png 424w, https://substackcdn.com/image/fetch/$s_!iOqx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png 848w, https://substackcdn.com/image/fetch/$s_!iOqx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png 1272w, https://substackcdn.com/image/fetch/$s_!iOqx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iOqx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png" width="1024" height="711" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:711,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!iOqx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png 424w, https://substackcdn.com/image/fetch/$s_!iOqx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png 848w, https://substackcdn.com/image/fetch/$s_!iOqx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png 1272w, https://substackcdn.com/image/fetch/$s_!iOqx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89133b7-0cb8-424a-aa06-4f8dd12e5567_1024x711.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Last but not least, let&#8217;s take a look at Graphext&#8217;s reporting capabilities. I work a lot with data scientists at Snowflake, and we tag-team on producing reports for our business leaders. We do analyses like root cause, perf and revenue impact of a particular feature launch, drivers of weekly/monthly/quarterly growth or decline. Each activity results in a doc with a narrative and supporting graphs. This is pretty standard for a PM and I&#8217;ve done this at AWS, Google and now Snowflake.</p><p>The unique thing about Snowflake is that the PMs here have a data scientist buddy who actually does the data analysis for them. They look at statistical significance, correlation-vs-causality, A/B tests and sampling. Fudging favorable conclusions becomes a little harder in this setup. My data scientist buddy likes to work in Jupyter and write code in pandas. A PM like me likes to work in Sheets+Docs. My data analysis tool is Sheets and I write my narratives in Docs (not Slides).</p><p>Other functions, based on my own experience, tend to avoid Jupyter. Finance likes Sheets/Excel and can read Docs. Engineering Managers live in Jira, but have to use Docs a lot, because .. reasons. Execs either prefer Docs/Word (if they are from AWS) or decks (if they are from Microsoft), and the Xooglers are still undecided. Sales likes Slack, but let&#8217;s set that aside for a moment.<br>So, this leads to Docs as the least common denominator for communicating insights in a high-tech org.</p><p>Graphext recognized that the end goal of every analysis (unless you are just grokking the data) is a report that goes to the product, marketers, execs, finance, etc. They&#8217;ve added the ability to annotate interesting findings, save them as &#8220;insights&#8221; and then product a report that is basically a collection of slides showing each individual insight. Here is, for example, one of these sliderized insights &#8211; &#8220;Older male passengers from the first class had a pretty decent chance of survival&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yzRk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yzRk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png 424w, https://substackcdn.com/image/fetch/$s_!yzRk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png 848w, https://substackcdn.com/image/fetch/$s_!yzRk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png 1272w, https://substackcdn.com/image/fetch/$s_!yzRk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yzRk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png" width="1024" height="565" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:565,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yzRk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png 424w, https://substackcdn.com/image/fetch/$s_!yzRk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png 848w, https://substackcdn.com/image/fetch/$s_!yzRk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png 1272w, https://substackcdn.com/image/fetch/$s_!yzRk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b715d7b-5df2-4b41-aac9-a34c235f0922_1024x565.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While I like it that they thought about the end goal (the insights report), I wish they had an option for me to produce a Doc instead of a PDF deck that I could then edit. These reports go through a bunch of editing and commenting, and I prefer to use the text editor for doing all of that. Dear Graphext, just give me the initial export of the insights into a doc, and allow me to embed your beautiful interactive charts (super important &#8211; no static image stuff, please)!</p><p>And this brings us to the end of this review. I liked Graphext and can see myself using it for root cause and business driver analysis. They are different from other BI tools out there, and as a techy I liked that they are experimenting with a new deployment architecture (thick client in WebAssembly in the browser and specialized services on the Cloud backend). The UX seems pretty sleek, albeit a bit buggy. I give the tool extra points for offering text analytics features. Nice job!</p>]]></content:encoded></item><item><title><![CDATA[Showing constituent parse trees in the browser]]></title><description><![CDATA[Update: Added the Stanford NLP link for constituent parse trees in text form]]></description><link>https://www.datancoff.ee/p/showing-constituent-parse-trees-in-the-browser</link><guid isPermaLink="false">https://www.datancoff.ee/p/showing-constituent-parse-trees-in-the-browser</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Mon, 21 May 2018 14:47:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d9ed16e7-8688-41ca-8bae-8050d8b7c38a_800x442.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Update: Added the Stanford NLP <a href="http://nlp.stanford.edu:8080/parser/">link</a> for constituent parse trees in text form</p><p>I needed to visualize a sentence parse tree of the &#8220;constituent&#8221; variety (you know, one of those with S, NP, VP and all the goods) and googled &#8220;show constituent trees&#8221;, expecting to see good quality hits on search result page 1. Turns out, there are no good quality web tools out there. Sure, there are sites that will offer you to download &#8220;applications that run on any version of Vista&#8221;, but my Vista desktop was temporarily unavailable that day. Google&#8217;s <a href="https://cloud.google.com/natural-language/">Cloud NLP front page</a> offers a nifty tool to parse and display trees, but those are of the &#8220;dependency&#8221; variety.</p><p>I dug a little and here is what I found. Here is a <a href="http://www.link.cs.cmu.edu/cgi-bin/link/construct-page-4.cgi#submit">link</a> to the Carnegie Mellon page that will return the constituency tree in text form (<strong>update</strong>: Stanford NLP has a similar feature at this <a href="http://nlp.stanford.edu:8080/parser/">location</a>)</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q9b7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q9b7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png 424w, https://substackcdn.com/image/fetch/$s_!Q9b7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png 848w, https://substackcdn.com/image/fetch/$s_!Q9b7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png 1272w, https://substackcdn.com/image/fetch/$s_!Q9b7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q9b7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Q9b7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png 424w, https://substackcdn.com/image/fetch/$s_!Q9b7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png 848w, https://substackcdn.com/image/fetch/$s_!Q9b7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png 1272w, https://substackcdn.com/image/fetch/$s_!Q9b7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1c67b83-524a-4987-8afb-22138f8e3e15_800x442.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>Take this text, replace round brackets with the square ones, and plug that text into the <a href="http://www.ironcreek.net/phpsyntaxtree/?">phpSyntaxTree</a> tool, or any of its clones (e.g. this <a href="https://yohasebe.com/rsyntaxtree/">one</a>).</p><p>Voila, here is your tree.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AcdS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AcdS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png 424w, https://substackcdn.com/image/fetch/$s_!AcdS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png 848w, https://substackcdn.com/image/fetch/$s_!AcdS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png 1272w, https://substackcdn.com/image/fetch/$s_!AcdS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AcdS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!AcdS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png 424w, https://substackcdn.com/image/fetch/$s_!AcdS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png 848w, https://substackcdn.com/image/fetch/$s_!AcdS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png 1272w, https://substackcdn.com/image/fetch/$s_!AcdS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25eca13d-c9d4-466a-a87f-fda2b264ade4_800x422.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div>]]></content:encoded></item><item><title><![CDATA[Predicting user engagement with news on Reddit using Kaggle or Colab]]></title><description><![CDATA[About a month ago I wrote a 3-part blog series (parts 1, 2, and 3) on predicting user engagement with news in Reddit communities (subreddits).]]></description><link>https://www.datancoff.ee/p/predicting-user-engagement-with-news-on-reddit-using-kaggle-or-colab</link><guid isPermaLink="false">https://www.datancoff.ee/p/predicting-user-engagement-with-news-on-reddit-using-kaggle-or-colab</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Tue, 08 May 2018 14:55:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/00ae5a8d-e0cd-406f-a683-6c6ee69f20f1_800x197.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>About a month ago I wrote a 3-part blog series (parts <a href="https://cloud.google.com/blog/big-data/2018/03/predicting-community-engagement-on-reddit-using-tensorflow-gdelt-and-cloud-dataflow-part-1">1</a>, <a href="https://cloud.google.com/blog/big-data/2018/03/predicting-community-engagement-on-reddit-using-tensorflow-gdelt-and-cloud-dataflow-part-2">2</a>, and <a href="https://cloud.google.com/blog/big-data/2018/03/predicting-community-engagement-on-reddit-using-tensorflow-gdelt-and-cloud-dataflow-part-3">3</a>) on predicting user engagement with news in Reddit communities (subreddits). The Jupyter notebook that was the basis of that research is now available as a Kaggle <a href="https://www.kaggle.com/datancoffee/predicting-community-engagement-on-reddit">kernel</a>, together with the companion <a href="https://www.kaggle.com/datancoffee/predicting-reddit-community-engagement-dataset">dataset</a>.</p><p>What I found was that the Publishing Domain and the Textual Content of a news article were good predictors of the Subreddit where a news article would end up at, as well as of the person who would share the news post. Furthermore, when used together, the Publishing Domain, Textual Content, <strong>and</strong> Submitter yielded a model that predicted the Subreddit with a 92% accuracy&#8202;&#8212;&#8202;something that could be used for anomaly detection, e.g. validating whether a posting to a Subreddit fits already seen behavior or constitutes a new trend or potentially a bot activity.</p><p>For user engagement metrics such as the Popularity Score or the Number of Commenters and Comments, the best predictors were the person who posted the article to Reddit (Post Submitter), and the Subreddit where the news was posted in. Models that used the Textual Content of the article did not have the best accuracy for predicting engagement metrics (whereas they were very predictive for Subreddit and Submitter), but I am currently working on improving the text embeddings I used and hope to publish more on this in the near future.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kIYB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kIYB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png 424w, https://substackcdn.com/image/fetch/$s_!kIYB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png 848w, https://substackcdn.com/image/fetch/$s_!kIYB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png 1272w, https://substackcdn.com/image/fetch/$s_!kIYB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kIYB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!kIYB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png 424w, https://substackcdn.com/image/fetch/$s_!kIYB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png 848w, https://substackcdn.com/image/fetch/$s_!kIYB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png 1272w, https://substackcdn.com/image/fetch/$s_!kIYB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36b93885-1199-4b22-a665-bea85c4d9b1d_800x197.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">Predictors for User Engagement with News in&nbsp;Reddit</figcaption></figure></div><p>With the notebook now checked in <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/tree/master/models">github</a> and shared in <a href="https://www.kaggle.com/datancoffee/predicting-community-engagement-on-reddit">Kaggle</a>, you can reproduce the results of the research much easier than before. Here are some tips and tricks for running the notebook in these environments.</p><h4>Optimizing the notebook execution for the hosting environment</h4><p>You can run this notebook either in <a href="https://colab.research.google.com">Colab</a> or in <a href="https://www.kaggle.com/datancoffee/predicting-community-engagement-on-reddit">Kaggle</a>. To run it in Colab, download the RedditEngagement notebook from <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/tree/master/models">github</a>, import it into <a href="https://colab.research.google.com">Colab</a>, and set the <code>current_run_in_colab</code> variable to <code>True</code> in the &#8220;Define Constants and Global Variables&#8221; code cell, otherwise, set it to <code>False</code>.</p><h4>You can choose between BigQuery and CSV files for the input&nbsp;dataset</h4><p>Decide if you want to get the training data from the <a href="https://bigquery.cloud.google.com/dataset/datancoffee:discussion_opinions?pli=1">datancoffee BigQuery dataset</a> or from <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/blob/master/models/data/reddit-ds.zip">snapshot CSV files</a>. At present time only Colab allows you accessing the datancoffee BigQuery dataset. To get training data from BigQuery, set the <code>current_read_from_bq</code> variable to <code>True</code> in the &#8220;Define Constants and Global Variables&#8221; code cell, otherwise, set it to <code>False</code>.</p><h4>Getting CSV&nbsp;files</h4><p>Download the <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/blob/master/models/data/reddit-ds.zip">reddit-ds.zip</a> snapshot file archive from the <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/">Dataflow Opinion Analysis github repository</a>. <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/blob/master/models/RedditEngagement.ipynb">Instructions</a> for setting up the dataset are available in the notebook itself. In Kaggle, the snapshot dataset is packaged with the kernel. You don&#8217;t have to do anything to use it.</p><h4>Run in Kaggle with&nbsp;GPUs</h4><p>Running with GPUs really makes the difference in execution time. Training runs with GPUs are ~20s vs. ~400s with regular CPUs.</p><h4>Improving model&nbsp;accuracy</h4><p>The number of data points in the full Reddit dataset is large, so the size of available memory is important. The <code>current_sample_frac</code> variable controls the fraction of the input dataset that will be sampled and then divided in training, test and validation subsets. The default settings in the notebook have been selected to run in the publicly hosted versions of Kaggle and Colab. For SubredditClassification goal the setting is <code>current_sample_frac = 0.5</code> and for the CommentsClassification goal the setting is <code>current_sample_frac = 0.25</code>.</p><p>Note that this is about <strong>half of what we used in our 3-part blog series</strong> and so the accuracy numbers you will get when running the notebook unmodified will be lower than what we have in our blog series.</p><p>Kaggle gives 6.5GB of memory when running with GPUs. To run the model with more accuracy, self-host the model in an environment with ~35&#8211;40GB of available memory. In this case you can set <code>current_sample_frac = 0.99</code> for SubredditClassification and <code>current_sample_frac = 0.5</code> (or higher) for CommentsClassification.</p><p>Enjoy and happy deep learning!</p>]]></content:encoded></item><item><title><![CDATA[How to programmatically monitor your Cloud Dataflow jobs]]></title><description><![CDATA[Ever wanted to define alerts and monitor the status of your Cloud Dataflow jobs programmatically instead of checking some UI every 30 minutes?]]></description><link>https://www.datancoff.ee/p/how-to-programmatically-monitor-your-cloud-dataflow-jobs</link><guid isPermaLink="false">https://www.datancoff.ee/p/how-to-programmatically-monitor-your-cloud-dataflow-jobs</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Fri, 23 Mar 2018 11:50:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever wanted to define alerts and monitor the status of your Cloud Dataflow jobs programmatically instead of checking some UI every 30 minutes?</p><p>You can do that with Stackdriver Monitoring and the metrics that Cloud Dataflow exports to Stackdriver. Here is a handy list of metrics available for Dataflow:</p><p><strong><a href="https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataflow">GCP Metrics List | Stackdriver Monitoring | Google Cloud</a></strong><a href="https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataflow"><br></a><em><a href="https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataflow">This page describes the predefined metrics available from Google Cloud Platform (GCP) services. These metrics are&#8230;</a></em><a href="https://cloud.google.com/monitoring/api/metrics_gcp#gcp-dataflow">cloud.google.com</a></p><p>My favorites:</p><p><code>job/is_failed</code>&#8202;&#8212;&#8202;This should be self-explanatory. Very easy to set up alerts on failed jobs.</p><p><code>job/system_lag, job/data_watermark_age</code>&#8202;&#8212;&#8202;If you have a streaming pipeline, these two metrics will indicate if your pipeline is beginning to lag.</p><p><code>job/elapsed_time</code>&#8202;&#8212;&#8202;Another easy-to-understand metric. This is how long your pipeline has taken so far. If you wanted to alert on pipelines exceeding predefining duration thresholds, here is your chance.</p><p>And if you still prefer a graphical UI for monitoring your metrics, check out this blog <a href="https://cloud.google.com/blog/big-data/2017/03/monitoring-and-improving-your-google-cloud-dataflow-pipelines-with-google-stackdriver">post</a>.</p>]]></content:encoded></item><item><title><![CDATA[Calculating per-job Cloud Dataflow costs - now possible with job labels]]></title><description><![CDATA[Ever wanted to track your resource usage and costs by specific Cloud Dataflow jobs? Cloud Dataflow recently started labeling billing records with Job Ids. Here is how to calculate you job costs.]]></description><link>https://www.datancoff.ee/p/calculating-per-job-cloud-dataflow-costs-now-possible-with-job-labels</link><guid isPermaLink="false">https://www.datancoff.ee/p/calculating-per-job-cloud-dataflow-costs-now-possible-with-job-labels</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Thu, 01 Mar 2018 15:07:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b51a7b71-a5d6-4a50-aa07-ee5ac30e8ce4_800x678.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever wanted to track your resource usage and costs by specific <a href="https://cloud.google.com/dataflow">Cloud Dataflow jobs</a>? Cloud Dataflow recently started labeling billing records with Job Ids. Here is how to calculate you job costs.</p><p>Enable billing export into BigQuery.</p><p>Go to Billing&gt;Billing export in the <a href="https://console.cloud.google.com/billing/">Google Cloud Console</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ejGA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ejGA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png 424w, https://substackcdn.com/image/fetch/$s_!ejGA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png 848w, https://substackcdn.com/image/fetch/$s_!ejGA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png 1272w, https://substackcdn.com/image/fetch/$s_!ejGA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ejGA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ejGA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png 424w, https://substackcdn.com/image/fetch/$s_!ejGA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png 848w, https://substackcdn.com/image/fetch/$s_!ejGA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png 1272w, https://substackcdn.com/image/fetch/$s_!ejGA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47b264f9-b1f6-48a6-b5b1-541fab7d63f1_800x678.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>Create a new BigQuery dataset, if needed, to store your billing data.</p><p>Give the billing export a little time to populate your dataset.</p><p>Then, in <a href="https://bigquery.cloud.google.com">BigQuery</a>, run this query:</p><pre><code>#standardSQL</code></pre><pre><code>SELECT l.value AS JobID,  ROUND(SUM(cost),3) AS JobCost 
FROM `[PROJECT].[DATASET].gcp_billing_export_v1_[account_id]` bill, UNNEST(bill.labels) l
WHERE service.description = 'Cloud Dataflow' and l.key = 'goog-dataflow-job-id'
GROUP BY 1</code></pre><p>You can get more than $$s from Cloud Dataflow billing exports. Each consumed SKU unit is labeled and you can calculate how many CPU hours, Memory GB hours, and other units you consumed per job using pretty much the same approach.</p><p>And the costs of my Dataflow jobs? They were in line with what I expected, but it was good to get reassured by real usage data.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vMF3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vMF3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png 424w, https://substackcdn.com/image/fetch/$s_!vMF3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png 848w, https://substackcdn.com/image/fetch/$s_!vMF3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png 1272w, https://substackcdn.com/image/fetch/$s_!vMF3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vMF3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vMF3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png 424w, https://substackcdn.com/image/fetch/$s_!vMF3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png 848w, https://substackcdn.com/image/fetch/$s_!vMF3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png 1272w, https://substackcdn.com/image/fetch/$s_!vMF3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c316562-a55b-4194-8cc3-164dde0f558d_800x497.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div>]]></content:encoded></item><item><title><![CDATA[Building dictionaries for Word Encodings using BigQuery SQL]]></title><description><![CDATA[I am working on the Reddit Community Engagement analysis, and one of my data sources is the GDELT BigQuery dataset. I love the richness of themes available in that data, but sometimes it can be too &#8230; chatty.]]></description><link>https://www.datancoff.ee/p/building-dictionaries-for-word-encodings-using-bigquery-sql</link><guid isPermaLink="false">https://www.datancoff.ee/p/building-dictionaries-for-word-encodings-using-bigquery-sql</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Wed, 07 Feb 2018 15:35:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I am working on the Reddit Community Engagement analysis, and one of my data sources is the GDELT BigQuery <a href="https://bigquery.cloud.google.com/dataset/gdelt-bq:gdeltv2">dataset</a>. I love the richness of themes available in that data, but sometimes it can be too&nbsp;&#8230; chatty.</p><p>Here is an example of how an LA Times <a href="http://www.latimes.com/local/lanow/la-me-ln-ghost-ship-fire-charges-20170605-story.html">article</a> about a warehouse fire is tagged in GDELT across 3 fields V2Themes, AllNames, and V2Locations of the GDELT global knowledge graph table <a href="https://bigquery.cloud.google.com/table/gdelt-bq:gdeltv2.gkg">gdelt-bq.gdeltv2.gkg</a>.</p><pre><code>MANMADE_DISASTER_POWER_OUTAGES Alameda_County__California__United_States WB_1174_WAREHOUSING_AND_STORAGE TAX_FNCACT_FIRE_CHIEF DISASTER_FIRE Oakland__California__United_States CRISISLEX_C07_SAFETY CRISISLEX_T03_DEAD TRIAL Alameda_County TAX_FNCACT_CHIEF EPU_POLICY_AUTHORITIES TAX_FNCACT_VICTIMS WB_793_TRANSPORT_AND_LOGISTICS_SERVICES TAX_FNCACT_CREATIVE_DIRECTOR WB_135_TRANSPORT TAX_FNCACT_ARTISTS TAX_FNCACT_AUTHORITIES MANMADE_DISASTER_IMPLIED Max_Harris TAX_FNCACT_CHILDREN SECURITY_SERVICES TRANSPARENCY CRISISLEX_T02_INJURED TAX_FNCACT_DOORMAN GOV_LOCALGOV TAX_WORLDLANGUAGES_SERRA EPU_CATS_MIGRATION_FEAR_FEAR TAX_FNCACT_INSPECTOR TAX_FNCACT_ATTORNEYS TAX_FNCACT_OFFICIALS WB_2445_NON_STATE_SECURITY_ACTORS Ghost_Ship CRISISLEX_CRISISLEXREC KILL TAX_FNCACT_POLICE WB_2451_REBELS_GUERRILLAS_AND_INSURGENTS Jeffrey_Krasnoff TAX_FNCACT_CORONER WB_2433_CONFLICT_AND_VIOLENCE CRISISLEX_T01_CAUTION_ADVICE SHORTAGE WB_2432_FRAGILITY_CONFLICT_AND_VIOLENCE ARREST TAX_FNCACT_DIRECTOR California__United_States POWER_OUTAGE SOC_GENERALCRIME CRISISLEX_T08_MISSINGFOUNDTRAPPEDPEOPLE TAX_FNCACT_PIRATE</code></pre><p>I want to put these tags to good use in Tensorflow models, and need to embed them into multi-dimensional vectors, but their sheer number is causing all kinds of memory, data size, and model accuracy issues. So, to have better control over what I eventually take over into my sequences of tags, I developed a SQL blurb that does the same thing that Keras <a href="https://keras.io/preprocessing/text/#tokenizer">Tokenizer</a> or Tensorflow <a href="http://tflearn.org/data_utils/">VocabularyProcessor</a> do. Having it available right where my source data is allows me to filter and order as I like (review the final query in this <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/blob/master/models/RedditEngagement.ipynb">repo</a>). I can even create Bag-of-Word or Sequence encodings, depending on my use case.</p><pre><code>WITH
   b AS (
    SELECT gkg.DocumentIdentifier, gkg.V2Themes, gkg.AllNames, gkg.V2Locations
    FROM `gdelt-bq.gdeltv2.gkg` gkg 
    LIMIT 1000
  )
  ,c AS ( -- Mentions of Themes
    SELECT b.DocumentIdentifier, SPLIT(theme_mentions,',')[SAFE_OFFSET(0)] AS Entity, SPLIT(theme_mentions,',')[SAFE_OFFSET(1)] AS Offset
    FROM b, UNNEST(SPLIT(b.V2Themes,";")) AS theme_mentions
  )
  ,d AS (
    SELECT b.DocumentIdentifier, 
    REPLACE(SPLIT(name_mentions,',')[SAFE_OFFSET(0)],' ','_') AS Name, 
    SPLIT(name_mentions,',')[SAFE_OFFSET(1)] AS Offset
    FROM b, UNNEST(SPLIT(b.AllNames,";")) AS name_mentions
  )
  ,e AS ( -- Calculate frequency stats for Name mentions
    SELECT Name, COUNT(DISTINCT DocumentIdentifier) FROM d 
    GROUP BY 1 HAVING COUNT(DISTINCT DocumentIdentifier) &gt;= 10
  )
  ,f AS (-- Filter mentions of Names
    SELECT d.DocumentIdentifier, d.Name AS Entity, d.Offset
    FROM d INNER JOIN e ON e.Name = d.Name
  )
  ,g AS ( -- Mentions of Locations
    SELECT b.DocumentIdentifier, SPLIT(loc_mentions,'#') AS LocFieldArray 
    FROM b, UNNEST(SPLIT(b.V2Locations,";")) AS loc_mentions
  )
  ,h AS (
    SELECT 
      g.DocumentIdentifier, 
      REPLACE(REPLACE(LocFieldArray[SAFE_OFFSET(1)],' ','_'),',','_') AS Loc, 
      LocFieldArray[SAFE_OFFSET(8)] AS Offset
    FROM g
  )
  ,i AS ( -- Calculate frequency stats for Location mentions
    SELECT Loc, COUNT(DISTINCT DocumentIdentifier) FROM h 
    GROUP BY 1 HAVING COUNT(DISTINCT DocumentIdentifier) &gt;= 10
  )
  ,j AS ( -- Filter mentions of Locations
    SELECT h.DocumentIdentifier, h.Loc AS Entity, h.Offset
    FROM h INNER JOIN i ON i.Loc = h.Loc
  )
  ,k AS ( -- Join all Themes, Locations, Names
    SELECT DocumentIdentifier, Entity, Offset FROM c
    UNION ALL
    SELECT DocumentIdentifier, Entity, Offset FROM f
    UNION ALL
    SELECT DocumentIdentifier, Entity, Offset FROM j
  ) 
  ,n AS ( -- Create Encoding for Entities
    SELECT Entity, cnt, CAST(RANK() OVER (ORDER BY cnt DESC, Entity ASC) AS STRING) AS EntityIdx 
    FROM (SELECT Entity, COUNT(*) AS cnt FROM k GROUP BY 1) 
  )
SELECT DocumentIdentifier, 
  STRING_AGG(DISTINCT EntityIdx," ") AS BOWEntitiesEncoded, -- For Bag-of-Words encoding order is not important
  COUNT(DISTINCT k.Entity) AS BOWEncodingLength,
  STRING_AGG(DISTINCT k.Entity," ") AS EntitiesBOW, 
  STRING_AGG(EntityIdx," " ORDER BY Offset ASC) AS EntitiesSeqEncoded, -- For CNN and RNN analysis, use Entity Sequence
  COUNT(*) AS EntitiesSeqLength,
  STRING_AGG(k.Entity," " ORDER BY Offset ASC) AS EntitiesSeq
FROM k
  INNER JOIN n ON k.Entity = n.Entity
WHERE k.Entity&lt;&gt;""
GROUP BY 1</code></pre><p>The result? Instead of the longish field with tags, I get a sequence of integers that work as good for Embedding purposes as do the tags themselves.</p><pre><code>72 153 693 27 68 443 5050 646 42 5333 345 447 15 1350 75 6509 6901 237 49 2073 1935 224 81 38 66 5 10 4 8819 55 14 451 150 1162 27041 942 252 16 409 669 5172 149 28 468 79 128 3478 3980 10288 70</code></pre>]]></content:encoded></item><item><title><![CDATA[Predicting social engagement for the world’s news with TensorFlow and Cloud Dataflow: Part 1]]></title><description><![CDATA[What happens if you take a huge cross-section of the world&#8217;s news (The GDELT Project), mix it with the biggest online discussion website, and try to predict what drives the online conversation about the news on Reddit?]]></description><link>https://www.datancoff.ee/p/predicting-social-engagement-for-the-worlds-news-with-tensorflow-and-cloud-dataflow-part-1</link><guid isPermaLink="false">https://www.datancoff.ee/p/predicting-social-engagement-for-the-worlds-news-with-tensorflow-and-cloud-dataflow-part-1</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Fri, 15 Dec 2017 18:50:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/89e3e63c-0c47-4410-b9a6-6b3eb1c8fcfc_484x495.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What happens if you take a huge cross-section of the world&#8217;s news (<a href="https://www.gdeltproject.org/">The GDELT Project</a>), mix it with the biggest <a href="http://www.reddit.com">online discussion</a> website, and try to predict what drives the online conversation about the news on Reddit? Is it the author of the news article, the publication time, the title, the contents of the post? Could it be the submitter on the discussion board or the sub-forum where the news was posted to?</p><p>One natural hypothesis to make is that the contents of a news article will influence the subreddit that this news post will end up being posted in. That&#8217;s a logical thing to assume, but is there proof, and if so, can we get a list of topics and correlate them with subreddits? In addition to the subreddit, can we deduce the factors that will drive the highest number of participants on the discussion thread, and the highest number of comments?</p><p>What&#8217;s the use of these insights? Imagine a news publisher or a marketing person who wants to see more engagement with the content that they are creating. Knowing what engages your audience can drive up your page views and ultimately revenue. And if you are a researcher who is tracking how news is being disseminated across various audiences, having this understanding will help you too.</p><p>A prediction task such as this one is a prime candidate for Machine Learning, and <a href="https://www.tensorflow.org/">TensorFlow</a> is the authoritative framework for training ML models and running predictions. In a sense, GDELT&#8217;s catalog of news coverage and Reddit discussions are a match made in heaven, because GDELT has the features, the inputs, for training models and Reddit has the labels, or the outputs of predictions. GDELT has articles indexed by the article URL and attributes such as Title, Author, Publication Date, the full content of the article etc. Reddit also has the article URL, but in addition to that brings the submitter of the article to the board, the subreddit (topic area), and all of the comments that the users of Reddit have created based on the submission.</p><p>As we went about building the TensorFlow model for this prediction task, we had to first assemble the dataset from which TensorFlow would be able to learn. In this multi-part blog series we will first explain how we built this input dataset, bringing GDELT and Reddit data together, and then we will explore training and predicting of outcomes.</p><p>Our first task was to place the GDELT and Reddit data into a database that was easy to query. BigQuery was the natural choice here. There are <a href="https://bigquery.cloud.google.com/table/gdelt-bq:gdeltv2.gkg_partitioned?tab=schema">GDELT</a> and <a href="https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_posts">Reddit</a> BigQuery datasets already, but we wanted to do a deeper sentiment analysis on the raw content of news articles and Reddit posts and comments, and for that we used the Dataflow Opinion Analysis project (<a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis">github repo</a>) we wrote about earlier.</p><p>Back in June 2017 Reza Rokni and John LaBarge&#8202;&#8212;&#8202;two GCP solutions architects&#8202;&#8212;&#8202;shared their best practices for developing production-quality Dataflow pipelines in a two-part blog series (<a href="https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1">blog 1</a>, and <a href="https://cloud.google.com/blog/big-data/2017/08/guide-to-common-cloud-dataflow-use-case-patterns-part-2">blog 2</a>). As we were solidifying the design for our data preparation pipeline, we used several of these best practices to future-proof our opinion analysis infrastructure. The rest of the article will explain in more detail how we applied these Dataflow design practices to the data problems we encountered in building our training dataset.</p><p>The top 10 design patterns for Dataflow can be roughly divided according to the regular lifecycle of a Dataflow pipeline:</p><ul><li><p>Orchestrating the execution of a pipeline</p></li><li><p>Onboarding external data</p></li><li><p>Joining data</p></li><li><p>Analyzing data and</p></li><li><p>Writing data</p></li></ul><p>We used patterns in almost every of the above categories. For onboarding external data, we used an external service access <a href="https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1#pattern-calling-external-services-for-data-enrichment">pattern</a> that allowed us to call to Cloud NLP to enrich our dataset with additional subject information. For joining the Reddit posts and comments, we used the CoGroupByKey <a href="https://cloud.google.com/blog/big-data/2017/08/guide-to-common-cloud-dataflow-use-case-patterns-part-2#pattern-joining-two-pcollections-on-a-common-key">pattern</a>. When performing data analysis, we relied on the GroupByKey <a href="https://cloud.google.com/blog/big-data/2017/08/guide-to-common-cloud-dataflow-use-case-patterns-part-2#pattern-groupby-using-multiple-data-properties">pattern</a> for key classes with multiple properties. And, lastly, when dealing with invalid or misformed input data, we implemented a Bigtable sink that collected invalid input records according to the Dead Letter queue <a href="https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1#pattern-dealing-with-bad-data">pattern</a>.</p><p>Let&#8217;s dive deeper into how these patterns helped us assemble our training dataset.</p><p>As we mentioned before, we sourced half of our training set (the features) from GDELT using files in gzipped JSON format. The other half of our training set (the labels) are Reddit posts and their comments, available in a BigQuery dataset in two tables: posts and comments. We wanted to run both the GDELT news articles and the Reddit posts/comments through a very similar processing pipeline, extracting subjects and sentiments from both of them, and storing the results in BigQuery. We therefore wrote a Dataflow pipeline that can change its input source depending on a parameter we pass to the job, but does all the downstream processing in the same way. Ultimately, we ended up with two BigQuery datasets that had the same schema, and that we could join with each other via the URL of the news article post using BiqQuery&#8217;s SQL.</p><p>Here is the high-level design of the Dataflow pipeline that we used to bring both the GDELT and Reddit data over to BigQuery.</p><ul><li><p>Read from inputs</p></li><li><p>Filter already processed URLs</p></li><li><p>Filter exact duplicates</p></li><li><p>Index (extract opinions and tags)</p></li><li><p>Filter &#8220;approximate&#8221; duplicates</p></li><li><p>Enrich with Cloud NLP</p></li><li><p>Write opinions to BigQuery and invalid inputs to Bigtable</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Eq-H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Eq-H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png 424w, https://substackcdn.com/image/fetch/$s_!Eq-H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png 848w, https://substackcdn.com/image/fetch/$s_!Eq-H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png 1272w, https://substackcdn.com/image/fetch/$s_!Eq-H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Eq-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Eq-H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png 424w, https://substackcdn.com/image/fetch/$s_!Eq-H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png 848w, https://substackcdn.com/image/fetch/$s_!Eq-H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png 1272w, https://substackcdn.com/image/fetch/$s_!Eq-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf211c1a-9597-460c-bb1c-68d1d07c852c_484x495.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>As we were implementing the input reader for Reddit data, we had to join the posts and comments table together. This join implementation follows the recommendations of our CoGroupByKey design <a href="https://cloud.google.com/blog/big-data/2017/08/guide-to-common-cloud-dataflow-use-case-patterns-part-2#pattern-joining-two-pcollections-on-a-common-key">pattern</a>. For each dataset in the join we created a KV key-value pair and then applied the CoGroupByKey operation to match the keys between the two datasets. After the join we iterated over the results to create a single entity that had both post and comment information.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Kt8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Kt8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png 424w, https://substackcdn.com/image/fetch/$s_!4Kt8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png 848w, https://substackcdn.com/image/fetch/$s_!4Kt8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png 1272w, https://substackcdn.com/image/fetch/$s_!4Kt8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Kt8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!4Kt8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png 424w, https://substackcdn.com/image/fetch/$s_!4Kt8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png 848w, https://substackcdn.com/image/fetch/$s_!4Kt8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png 1272w, https://substackcdn.com/image/fetch/$s_!4Kt8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0cac6f-e165-4c8e-a76b-14db91b9949d_800x760.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Pro Tips:</p><ul><li><p>Side Inputs could be a faster alternative to CoGroupByKey when joining datasets if one of the datasets can fit into the memory of a Dataflow worker VM.</p></li><li><p>Use the service-based <a href="https://cloud.google.com/dataflow/service/dataflow-service-desc#cloud-dataflow-shuffle">Dataflow Shuffle</a> for better performance of your CoGroupByKey. Add the following parameter to your pipeline:&#8202;&#8212;&#8202;experiments=shuffle_mode=service</p></li></ul><p>After reading our inputs&#8202;&#8212;&#8202;either news articles from GDELT, or posts from Reddit&#8202;&#8212;&#8202;we proceeded to eliminating duplicates from our data. Duplicate elimination is typically done either by looking up a record ID among a list of already processed IDs, or, if the record wasn&#8217;t yet processed, by grouping elements by some key that represents uniqueness.</p><p>There are three levels of filtering in our pipeline that use both of these techniques. In the first filter we simply check if the URL of the record (in effect, the ID column in our dataset) was already processed and stored in BigQuery. For this check we use a side input of hashes of the URLs and populate it from the BigQuery dataset we are writing to.</p><p>In the second and third filters, which we apply only to news articles in GDELT, we are looking at the contents of the news article. We don&#8217;t need to do this filtering to Reddit records because our data source guarantees uniqueness of posts. However, the GDELT data source is highly redundant, reflecting the fact that the world of news is highly repetitive as well.</p><p>Original news stories are still being written by staff at major news publications or by bloggers, but a large percentage of news websites, especially regional publications, reposts news from news wires such as Associated Press, Reuters, Agence France Presse, Deutsche Welle etc. Our pipeline makes an effort to identify such reposts and group them together.</p><p>A note on terminology we will use in the remainder of this blog:</p><ul><li><p>Original post: A web page where the original news story was published for the first time</p></li><li><p>Repost: A web page which re-posted/republished an original story</p></li><li><p>Publisher: the domain where a story (original or reposted) was published</p></li></ul><p>When grouping raw news articles, we first try to match articles that are identical in content, without any variation in text. In our pipeline this is represented by Filter #2, and this filtering is implemented by grouping all incoming text articles by the hash we calculate over the text of the article concatenated with the Publication Date.</p><p>After this initial, simplistic grouping, we then try to group articles that have small variations in text, for example, because they prefix the source of the publication, e.g. (&#8220;CHICAGO (AP)&#8220;), or because they add a copyright statement at the end. This is a more advanced grouping than just taking a hash over the text. Here, we need to group on combinations of attributes of the article.</p><p>The grouping <a href="https://cloud.google.com/blog/big-data/2017/08/guide-to-common-cloud-dataflow-use-case-patterns-part-2#pattern-groupby-using-multiple-data-properties">pattern</a> recommends defining a class that has all of the attributes by which you want to group, and then tag this class with AvroCoder as the @DefaultCoder of the class. By doing that we get for free the serialization/deserialization of objects of this class, and we can then use the GroupByKey to group by it.</p><pre><code>/**</code></pre><pre><code>* @param Document indexes</code></pre><pre><code>* @return a POJO containing 2 PCollections: Unique docs, and Duplicates</code></pre><pre><code>*/</code></pre><pre><code>private static ContentDuplicateOrNot filterSoftDuplicates(
</code></pre><pre><code>  PCollection&lt;ContentIndexSummary&gt; indexes) {</code></pre><pre><code>PCollectionTuple dedupeOrNot = indexes</code></pre><pre><code>.apply("Extract Text grouping key",</code></pre><pre><code>ParDo.of(new GetContentIndexSummaryKeyFn()))</code></pre><pre><code>.apply("Group by Text grouping key",</code></pre><pre><code>GroupByKey.&lt;ContentSoftDeduplicationKey, ContentIndexSummary&gt;create())</code></pre><pre><code>.apply("Eliminate Text dupes",</code></pre><pre><code>ParDo.of(new EliminateTextDupes())</code></pre><pre><code>.withOutputTags(PipelineTags.indexedContentNotToDedupeTag,</code></pre><pre><code>TupleTagList.of(PipelineTags.indexedContentToDedupeTag)));</code></pre><pre><code>PCollection&lt;TableRow&gt; dedupedWebresources =</code></pre><pre><code>dedupeOrNot.get(PipelineTags.indexedContentToDedupeTag)</code></pre><pre><code>.apply(ParDo.of(new CreateWebresourceTableRowFromDupeIndexSummaryFn()));</code></pre><pre><code>ContentDuplicateOrNot contentDuplicateOrNot = new ContentDuplicateOrNot(</code></pre><pre><code>dedupeOrNot.get(PipelineTags.indexedContentNotToDedupeTag),</code></pre><pre><code>dedupedWebresources);</code></pre><pre><code>return contentDuplicateOrNot;</code></pre><pre><code>}</code></pre><p>The document attributes that have shown to be effective in finding similar content include the Title of document, document length rounded to a thousand characters, and the topics that are discussed in the articles. The title and document length is something we get without any additional processing because it is part of the source dataset. To get topics of the document we need to run NLP and sentiment analysis, and this is why this third and last filtering step is slotted behind the &#8220;Indexing&#8221; step. The content-based grouping is highly accurate and allows us to reduce the number of stories we insert into our analysis dataset by about 45%. For example, in June 2017 there were 6.1M web pages from which the GDELT project extracted English-language news, but only 3.7M unique stories. The additional benefit from this deduplication is that we are able to aggregate the social impact of all the reposts of an original story.</p><p>As we processed text documents in the Indexing step, we sometimes encountered cases where the inputs were malformed and could not be processed. For example, in GDELT we sometimes saw text full of CSS formatting, the result of edge cases in the GDELT web crawler. Or, and this is typical for social content on Reddit, we saw posts that consisted entirely of Emoticons or <a href="https://en.wikipedia.org/wiki/Unicode_symbols">Unicode symbol</a> characters that could not be processed by NLP libraries. This is another opportunity to access our bag of patterns. Normally, a developer would place a potentially &#8220;dangerous&#8221; block of code into a try/catch block and handle exceptions by logging them. The Dead Letter processing design <a href="https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1#pattern-dealing-with-bad-data">pattern</a> recommends, in addition to logging, redirecting such bad records into a Side Output and then storing them in a database that supports fast writes, for example Bigtable, Datastore or BigQuery. It then recommends periodically reviewing the contents of the Dead Letter tables and debug these bad records, determining the cause of the exceptions. Usually, a change in business logic, covering the additional edge case, or improving data collection will be needed to address bad records, but at the very least, we are not ignoring these errors and preserve data that would have been otherwise lost. In our pipeline we implemented a side output in the Indexing step and a Bigtable sink for this side output.</p><p>We&#8217;ve talked about how the Indexing step of the pipeline finds topics of documents by using the Sirocco opinion extraction library. Sirocco is not the only way to get topics of documents. Cloud NLP provides an API for classifying a text document into ~700 categories. Additionally, it provides mechanisms for extracting entities from text, which is very similar to what Sirocco does. We integrated entity extraction by CloudNLP with entity extraction by Sirocco into our pipeline to benefit from a richer classification of text in our pipeline, and to do that we applied another <a href="https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1#pattern-calling-external-services-for-data-enrichment">pattern</a> from the pattern blog posts.</p><p>After the opinion extraction in the Indexing step, the CloudNLP enrichment, and final filtering steps data is finally ready to be inserted into BigQuery. Note that both the Reddit and GDELT data will end up in distinct datasets, however, the schemas of these datasets are the same, and we can join the news articles in GDELT with the Reddit posts via the URL field of the news post.</p><p>This concludes the first part of our blog series, focused on preparing the data for model training. We have extracted features from our GDELT article dataset such as the entities found in the article, the author or the news post, publication date and time, and the sentiment. We also have our training labels extracted from the Reddit posts and comments, including the submitter of the Reddit post, the subreddit, and the sentiment of comments. We have cleansed our training data from duplicates, and in the process gained useful information about reposts which will help us providing a more accurate estimate of the social impact.</p><p><strong>Next steps</strong></p><p>In the next set of blogs we will introduce our methodology for developing our TensorFlow model. We hope you will be able to apply the Dataflow best practices in your own data processing pipelines. Here are a few useful links if you want to learn more.</p><p><a href="https://cloud.google.com/blog/big-data/2017/06/guide-to-common-cloud-dataflow-use-case-patterns-part-1">Guide to common Cloud Dataflow use-case patterns, Part 1</a></p><p><a href="https://cloud.google.com/blog/big-data/2017/08/guide-to-common-cloud-dataflow-use-case-patterns-part-2">Guide to common Cloud Dataflow use-case patterns, Part 2</a></p><p><a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis">Dataflow Opinion Analysis github repo</a></p>]]></content:encoded></item><item><title><![CDATA[Million Dollar Idea: Order Tracking at Starbucks]]></title><description><![CDATA[Uber lessened the anxiety of cab callers by providing time estimates and position of cars on the map.]]></description><link>https://www.datancoff.ee/p/million-dollar-idea-order-tracking-at-starbucks</link><guid isPermaLink="false">https://www.datancoff.ee/p/million-dollar-idea-order-tracking-at-starbucks</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Sat, 02 Sep 2017 19:17:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Uber lessened the anxiety of cab callers by providing time estimates and position of cars on the map. Standing in an early-morning Starbucks line waiting for my Americano, I thought: &#8220;What if their app showed me the position of my order in their &#8220;assembly line&#8221;. It would make me be less anxious and wanting to ask where my order was.&#8221; They could do it by having barcodes or rfid in cups.</p>]]></content:encoded></item><item><title><![CDATA[How to speed up your BigQuery query 31x by replacing a self-join with two UNNEST() operations]]></title><description><![CDATA[One of my trend calculation queries in the Opinion Analysis project started causing trouble recently.]]></description><link>https://www.datancoff.ee/p/how-to-speed-up-your-bigquery-query-31x-by-replacing-a-self-join-with-two-unnest-operations</link><guid isPermaLink="false">https://www.datancoff.ee/p/how-to-speed-up-your-bigquery-query-31x-by-replacing-a-self-join-with-two-unnest-operations</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Sun, 09 Jul 2017 18:54:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of my trend calculation <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/blob/master/src/main/java/com/google/cloud/dataflow/examples/opinionanalysis/StatsCalcPipelineUtils.java">queries</a> in the <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis">Opinion Analysis project</a> started causing trouble recently. It would run for 270 seconds and break with an error &#8220;Query exceeded resource limits for tier 1. Tier 8 or higher required.&#8221; Changing the billing tier to the recommended Tier 8 did help, but the query still took 380 seconds to complete.<br>After some research I found out that the self-joins I use in part of the query were the culprit. Take a look at the join condition at the bottom of this query (in the CalcStatCombiTopics temp table):</p><pre><code>INSERT INTO opinions.stattopic (...)
WITH 
p AS (
 SELECT 20170630 AS SnapshotDateId
),
CalcStatSentiments AS (
 SELECT p.SnapshotDateId, t.Tag, ... s.SentimentHash,...
 FROM opinions.document d, p
 INNER JOIN opinions.sentiment s ON s.DocumentHash = d.DocumentHash, UNNEST(s.Tags) AS t
 INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
 INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
 AND wrRepost.CollectionItemId = d.CollectionItemId
 WHERE
 d.PublicationDateId = p.SnapshotDateId AND s.SentimentTotalScore &gt; 0
),
CalcStatTopics AS (
&#8230;
),
CalcStatCombiTopics AS (
 SELECT 
 css1.SnapshotDateId, CONCAT(css1.Tag,&#8217; &amp; &#8216;,css2.Tag) AS Topic, [css1.Tag,css2.Tag] AS Tags, true AS GoodAsTopic, 2 AS TagCount,
 ...
 FROM
 CalcStatSentiments css1, CalcStatSentiments css2
 WHERE
 css1.SentimentHash = css2.SentimentHash AND
 css1.Tag &lt; css2.Tag
 GROUP BY css1.SnapshotDateId, css1.Tag, css2.Tag
),
&#8230;</code></pre><p>Basically, I have a table &#8220;opinions.sentiment&#8221; with an identity column SentimentHash, a repeated field Tags, and a bunch of other columns. The Tags column contains an array of textual tags that I extract from text using the Opinion Analysis <a href="https://cloud.google.com/dataflow">Cloud Dataflow</a> <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis/blob/master/src/main/java/com/google/cloud/dataflow/examples/opinionanalysis/IndexerPipeline.java">IndexerPipeline</a>. In the pre-BigQuery days I would have used a separate table to store the tags and link it to the main Sentiment table via the identity column SentimentHash. However, in BigQuery this is much more easily done with repeated fields.</p><p>When I calculate trends, I build frequency statistics for combinations of tags (e.g. how many news articles were about both &#8220;climate change&#8221; and &#8220;G-20&#8221;). To do so, I flatten the Tags field and create a temporary table CalcStatSentiments that contains individual records for each tag, with fields like SentimentHash as well as the actual tag. I then do a self-join of the CalcStatSentiments table to build what I call &#8220;topics&#8221; (in the CalcStatCombiTopics temp table).</p><p>It turns out that self-joins are bad for (your health) the performance of your queries, as illustrated in this <a href="http://www.lunametrics.com/blog/2016/05/12/self-joins-windowing-user-defined-functions-bigquery/">blog post</a>. It suggests replacing the self-joins with Windowing. I considered doing that, but I actually needed to end up with permutations of tags instead of aggregate statistics, for which Windowing would work great, so I came up with another technique.</p><p><strong>Announcing Sergei&#8217;s Self-Join Elimination Technique &#169;</strong></p><p>Instead of flattening my Tags field only once (in the CalcStatSentiments part of my query), <em><strong>I now flatten it the first time, carry the Tags array in the result set, and then flatten it the second time to emulate the CROSS JOIN operation.</strong></em></p><p>Here is how it looks in the new query:</p><pre><code>WITH 
p AS (
 SELECT 20170630 AS SnapshotDateId
),
SentimentTags AS (
  SELECT p.SnapshotDateId, s.SentimentHash, t.Tag, t.GoodAsTopic, s.Tags AS Tags
  FROM p, opinions.sentiment s, UNNEST(s.Tags) AS t
  WHERE
    s.DocumentDateId = p.SnapshotDateId AND s.SentimentTotalScore &gt; 0
),
SentimentTagCombos AS (
  SELECT st.SnapshotDateId, st.SentimentHash, st.Tag AS Tag1, stt.Tag AS Tag2 
  FROM SentimentTags st, UNNEST(st.Tags) stt
  WHERE st.Tag &lt; stt.Tag
),
...</code></pre><p>The inequality filter `WHERE st.Tag &lt; stt.Tag` ensures that I do not get duplicates in my tag combos. It works the same way as the inequality filter in the original version of my query.</p><pre><code>WHERE
 css1.SentimentHash = css2.SentimentHash AND
 css1.Tag &lt; css2.Tag</code></pre><p>Once I calculated my Tag1 &amp; Tag2 combinations, I join my result set via the SentimentHash record ID to my main dataset and conclude all the calculations.</p><p>The result: instead of a query that takes 380 seconds to complete in billing tier 8, my modified query runs in 12 seconds in billing tier 1.</p><p>Conclusion:</p><p>Using self-joins = BAD.&nbsp;<br>Replacing them with dual UNNEST() = Priceless! (or something like that)</p><p>Here are the queries if you want to compare the syntax.</p><p>New, performance-optimized query:</p><pre><code>INSERT INTO opinions.stattopic (SnapshotDateId, Topic, Tags, TagCount, cntOrigPublishers, cntRepostWRs,
  cntPositives, cntNegatives, cntAmbiguous, cntGeneral, SentimentHashes, OrigWebResourceHashes, RepostWebResourceHashes )
WITH 
p AS (
 SELECT 20170630 AS SnapshotDateId
),
SentimentTags AS (
  SELECT p.SnapshotDateId, s.SentimentHash, t.Tag, t.GoodAsTopic, s.Tags AS Tags
  FROM p, opinions.sentiment s, UNNEST(s.Tags) AS t
  WHERE
    s.DocumentDateId = p.SnapshotDateId AND s.SentimentTotalScore &gt; 0
),
SentimentTagCombos AS (
  SELECT st.SnapshotDateId, st.SentimentHash, st.Tag AS Tag1, stt.Tag AS Tag2 
  FROM SentimentTags st, UNNEST(st.Tags) stt
  WHERE st.Tag &lt; stt.Tag
),
CalcStatSentiments AS (
  SELECT st.SnapshotDateId, st.Tag, st.GoodAsTopic, d.DocumentHash AS DocumentHash, s.SentimentHash,
    wrOrig.WebResourceHash AS OrigWebResourceHash, wrOrig.Domain AS OrigDomain, wrRepost.WebResourceHash AS RepostWebResourceHash,
    s.DominantValence AS Valence, d.PublicationTime AS PublicationTime
  FROM SentimentTags st
    INNER JOIN opinions.sentiment s ON s.SentimentHash = st.SentimentHash AND s.DocumentDateId = st.SnapshotDateId
    INNER JOIN opinions.document d ON d.DocumentHash = s.DocumentHash AND d.PublicationDateId = st.SnapshotDateId
    INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
    INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
      AND wrRepost.CollectionItemId = d.CollectionItemId
),
CalcStatTopics AS (
  SELECT
    c.SnapshotDateId, c.Tag AS Topic, [c.Tag] AS Tags, c.GoodAsTopic, 1 AS TagCount,
    COUNT(distinct OrigDomain) as cntOrigPublishers,
    COUNT(distinct RepostWebResourceHash) as cntRepostWRs,
    COUNT(distinct (case when c.Valence=1 then c.SentimentHash else null end)) as cntPositives,
    COUNT(distinct (case when c.Valence=2 then c.SentimentHash else null end)) as cntNegatives,
    COUNT(distinct (case when c.Valence=3 then c.SentimentHash else null end)) as cntAmbiguous,
    COUNT(distinct (case when c.Valence=5 then c.SentimentHash else null end)) as cntGeneral,
    ARRAY_AGG(DISTINCT c.SentimentHash) AS SentimentHashes,
    ARRAY_AGG(DISTINCT c.OrigWebResourceHash) AS OrigWebResourceHashes,
    ARRAY_AGG(DISTINCT c.RepostWebResourceHash) AS RepostWebResourceHashes
  FROM CalcStatSentiments c
  GROUP BY c.SnapshotDateId, c.Tag, c.GoodAsTopic
),
CalcStatCombiTopics AS (
  SELECT 
    stc.SnapshotDateId, CONCAT(stc.Tag1,' &amp; ',stc.Tag2) AS Topic, [stc.Tag1,stc.Tag2] AS Tags, true AS GoodAsTopic, 2 AS TagCount,
    COUNT(distinct wrOrig.Domain) as cntOrigPublishers,
    COUNT(distinct wrRepost.WebResourceHash) as cntRepostWRs,
    COUNT(distinct (case when s.DominantValence=1 then s.SentimentHash else null end)) as cntPositives,
    COUNT(distinct (case when s.DominantValence=2 then s.SentimentHash else null end)) as cntNegatives,
    COUNT(distinct (case when s.DominantValence=3 then s.SentimentHash else null end)) as cntAmbiguous,
    COUNT(distinct (case when s.DominantValence=5 then s.SentimentHash else null end)) as cntGeneral,
    ARRAY_AGG(DISTINCT s.SentimentHash) AS SentimentHashes,
    ARRAY_AGG(DISTINCT wrOrig.WebResourceHash) AS OrigWebResourceHashes,
    ARRAY_AGG(DISTINCT wrRepost.WebResourceHash) AS RepostWebResourceHashes
  FROM SentimentTagCombos stc
    INNER JOIN opinions.sentiment s ON s.SentimentHash = stc.SentimentHash AND s.DocumentDateId = stc.SnapshotDateId
    INNER JOIN opinions.document d ON d.DocumentHash = s.DocumentHash AND d.PublicationDateId = stc.SnapshotDateId
    INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
    INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
      AND wrRepost.CollectionItemId = d.CollectionItemId
  GROUP BY stc.SnapshotDateId, stc.Tag1, stc.Tag2
  -- HAVING cntPublisherDomains &gt; 1
),
CalcStatAllTopics AS (
  SELECT * FROM CalcStatTopics
  WHERE GoodAsTopic = true AND cntRepostWRs &gt; 1
  UNION ALL
  SELECT * FROM CalcStatCombiTopics
  WHERE GoodAsTopic = true AND cntRepostWRs &gt; 1
)
SELECT SnapshotDateId, Topic, Tags, TagCount, cntOrigPublishers, cntRepostWRs,
  cntPositives, cntNegatives, cntAmbiguous, cntGeneral, SentimentHashes, OrigWebResourceHashes, RepostWebResourceHashes
FROM CalcStatAllTopics</code></pre><p>Original Query:</p><pre><code>INSERT INTO opinions.stattopic (SnapshotDateId, Topic, Tags, TagCount, cntOrigPublishers, cntRepostWRs,
  cntPositives, cntNegatives, cntAmbiguous, cntGeneral, SentimentHashes, OrigWebResourceHashes, RepostWebResourceHashes )
WITH 
p AS (
 SELECT 20170630 AS SnapshotDateId
),
CalcStatSentiments AS (
  SELECT p.SnapshotDateId, t.Tag, t.GoodAsTopic, d.DocumentHash AS DocumentHash, s.SentimentHash,
    wrOrig.WebResourceHash AS OrigWebResourceHash, wrOrig.Domain AS OrigDomain, wrRepost.WebResourceHash AS RepostWebResourceHash,
    s.DominantValence AS Valence, d.PublicationTime AS PublicationTime
  FROM opinions.document d, p
    INNER JOIN opinions.sentiment s ON s.DocumentHash = d.DocumentHash, UNNEST(s.Tags) AS t
    -- Need to use Sentiment tags, so that sentiments relate to topics
    INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
    INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
      AND wrRepost.CollectionItemId = d.CollectionItemId
  WHERE
    d.PublicationDateId = p.SnapshotDateId AND s.SentimentTotalScore &gt; 0
),
CalcStatTopics AS (
  SELECT
    c.SnapshotDateId, c.Tag AS Topic, [c.Tag] AS Tags, c.GoodAsTopic, 1 AS TagCount,
    COUNT(distinct OrigDomain) as cntOrigPublishers,
    COUNT(distinct RepostWebResourceHash) as cntRepostWRs,
    COUNT(distinct (case when c.Valence=1 then c.SentimentHash else null end)) as cntPositives,
    COUNT(distinct (case when c.Valence=2 then c.SentimentHash else null end)) as cntNegatives,
    COUNT(distinct (case when c.Valence=3 then c.SentimentHash else null end)) as cntAmbiguous,
    COUNT(distinct (case when c.Valence=5 then c.SentimentHash else null end)) as cntGeneral,
    ARRAY_AGG(DISTINCT c.SentimentHash) AS SentimentHashes,
    ARRAY_AGG(DISTINCT c.OrigWebResourceHash) AS OrigWebResourceHashes,
    ARRAY_AGG(DISTINCT c.RepostWebResourceHash) AS RepostWebResourceHashes
  FROM CalcStatSentiments c
  GROUP BY c.SnapshotDateId, c.Tag, c.GoodAsTopic
),
CalcStatCombiTopics AS (
  SELECT 
    css1.SnapshotDateId, CONCAT(css1.Tag,' &amp; ',css2.Tag) AS Topic, [css1.Tag,css2.Tag] AS Tags, true AS GoodAsTopic, 2 AS TagCount,
    COUNT(distinct css1.OrigDomain) as cntOrigPublishers,
    COUNT(distinct css1.RepostWebResourceHash) as cntRepostWRs,
    COUNT(distinct (case when css1.Valence=1 then css1.SentimentHash else null end)) as cntPositives,
    COUNT(distinct (case when css1.Valence=2 then css1.SentimentHash else null end)) as cntNegatives,
    COUNT(distinct (case when css1.Valence=3 then css1.SentimentHash else null end)) as cntAmbiguous,
    COUNT(distinct (case when css1.Valence=5 then css1.SentimentHash else null end)) as cntGeneral,
    ARRAY_AGG(DISTINCT css1.SentimentHash) AS SentimentHashes,
    ARRAY_AGG(DISTINCT css1.OrigWebResourceHash) AS OrigWebResourceHashes,
    ARRAY_AGG(DISTINCT css1.RepostWebResourceHash) AS RepostWebResourceHashes
  FROM
    CalcStatSentiments css1, CalcStatSentiments css2
  WHERE
    css1.SentimentHash = css2.SentimentHash AND
    css1.Tag &lt; css2.Tag
  GROUP BY css1.SnapshotDateId, css1.Tag, css2.Tag
  -- HAVING cntPublisherDomains &gt; 1
),
CalcStatAllTopics AS (
  SELECT * FROM CalcStatTopics
  WHERE GoodAsTopic = true AND cntRepostWRs &gt; 1
  UNION ALL
  SELECT * FROM CalcStatCombiTopics
  WHERE GoodAsTopic = true AND cntRepostWRs &gt; 1
)
SELECT SnapshotDateId, Topic, Tags, TagCount, cntOrigPublishers, cntRepostWRs,
  cntPositives, cntNegatives, cntAmbiguous, cntGeneral, SentimentHashes, OrigWebResourceHashes, RepostWebResourceHashes
FROM CalcStatAllTopics</code></pre>]]></content:encoded></item><item><title><![CDATA[Serverless ETL for Sirocco on Google Cloud]]></title><description><![CDATA[I published an architecture of a serverless ETL solution for Sirocco on the GCP Big Data blog.]]></description><link>https://www.datancoff.ee/p/serverless-etl-for-sirocco-on-google-cloud</link><guid isPermaLink="false">https://www.datancoff.ee/p/serverless-etl-for-sirocco-on-google-cloud</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Thu, 11 May 2017 19:00:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I published an architecture of a <a href="https://cloud.google.com/blog/big-data/2017/05/designing-etl-architecture-for-a-cloud-native-data-warehouse-on-google-cloud-platform">serverless ETL solution for Sirocco</a> on the GCP Big Data blog. This solution scales from a few news articles in a cloud bucket to millions of news posts in a database, taking advantage of Cloud Dataflow&#8217;s autoscaling features. With this blog you now should have all the components for building a news monitoring or a opinion tracking solution. I know it because I am using exactly the same setup for an actual news monitoring solution&#8202;&#8212;&#8202;more about it in a future post.</p><p>Here is what I suggest you do:</p><ul><li><p>Read about <a href="https://medium.com/@datancoffee/opinion-analysis-of-text-using-plutchik-5119a80229ea">Plutchik&#8217;s framework</a> for Emotion analysis to understand the theory behind this solution</p></li><li><p>Read about the <a href="https://cloud.google.com/blog/big-data/2017/05/designing-etl-architecture-for-a-cloud-native-data-warehouse-on-google-cloud-platform">ETL solution</a></p></li><li><p>Go to the <a href="https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis">github repo</a> and follow the instructions in README. Set up your own processing pipeline and run a test crunching a few news articles that I uploaded to the test folder.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[Opinion Analysis of Text using Plutchik]]></title><description><![CDATA[I got inspired to write this blog by a post I saw today on the French presidential election.]]></description><link>https://www.datancoff.ee/p/opinion-analysis-of-text-using-plutchik</link><guid isPermaLink="false">https://www.datancoff.ee/p/opinion-analysis-of-text-using-plutchik</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Thu, 04 May 2017 19:02:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/06c31a74-c178-4fb6-874b-4804d6c43e63_800x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I got inspired to write this blog by a <a href="http://odintext.com/blog/french-election-upset/">post</a> I saw today on the French presidential election. Plutchik is really the strongest framework I know for sentiment analysis that goes beyond pedestrian positive/negative classification of sentiment. It was good to see that it was being adopted by others in the industry. So, here we go.</p><p>To recap some of my earlier posts, <a href="https://github.com/datancoffee/sirocco">Sirocco</a> parses each news article or user-generated content into subjects and opinions. The subject and the opinion, together with the author of the news article, form a triad that allows us to answer the question: &#8220;Who (author) thinks about what (the subject) in what way (the opinion)?&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ymMB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ymMB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png 424w, https://substackcdn.com/image/fetch/$s_!ymMB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png 848w, https://substackcdn.com/image/fetch/$s_!ymMB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png 1272w, https://substackcdn.com/image/fetch/$s_!ymMB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ymMB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da4adc86-c702-4424-8102-bd63e7fea842_800x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ymMB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png 424w, https://substackcdn.com/image/fetch/$s_!ymMB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png 848w, https://substackcdn.com/image/fetch/$s_!ymMB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png 1272w, https://substackcdn.com/image/fetch/$s_!ymMB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda4adc86-c702-4424-8102-bd63e7fea842_800x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">Author-Opinion-Subject Triad</figcaption></figure></div><p>Finding out who authored a news piece tends to be straightforward. Their names are either on the byline of the article and can be parsed out of the text, or are in the metadata of the record that is part of an imported dataset.</p><p>We think of news articles as collections of opinions of various degrees of factuality. The individual opinions can be in a single sentence or in a sequence of sentences, and they are often, but not always, separated by paragraphs. Sirocco analyzes individual sentences and then &#8220;chunks&#8221; related sentences together, to form an &#8220;opinion&#8221;.</p><p>To determine the subjects of a sentence, we take the sentence <a href="https://en.wikipedia.org/wiki/Parse_tree">parse tree</a> (the constituency-based parse tree, to be precise) and extract Proper Nouns (aka Named Entities) and Noun Phrases. Named Entities are names of persons, organizations, locations, expressions of times, quantities and etc. Typically, they can be found by looking at the first letter of a word (if it is capitalized, and in the middle of a sentence, then it&#8217;s a good sign that it&#8217;s a Named Entity). There are many more rules for when a combination of words represent a name, but that&#8217;s why we have machine learning and training data sets, and the quality of Named Entity extraction has reached pretty high levels. In the terminology of the Penn Treebank Project, Named Entities are Proper Nouns, carrying the <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">NNP part-of-speech tag</a>.</p><p>Noun Phrases are 1-to-N-word phrases of nouns, adjectives, etc. For example, &#8220;missile test&#8221; and &#8220;air defense force&#8221; are all noun phrases. The NLP algorithms group words together into Noun Phrases based on their role in the sentence, and observed patterns in the training data sets. It&#8217;s helpful to process both Named Entities and Noun Phrases as potential subjects, because in many cases a sentence won&#8217;t have a name mentioned in it, but they will usually have a Noun Phrase.</p><p>Sentiment or opinion extraction is where the bulk of the work is done. Opinions can be roughly divided into emotions (&#8220;very interesting&#8221;) and expressions of qualities (&#8220;nice&#8221;, &#8220;big&#8221;). The Sirocco emotion detection technology is using a framework for human emotions originally developed by <a href="https://en.wikipedia.org/wiki/Robert_Plutchik">Robert Plutchik</a>, a professor at the Albert Einstein College of Medicine.</p><p>Plutchik&#8217;s <a href="https://en.wikipedia.org/wiki/Contrasting_and_categorization_of_emotions#Plutchik.27s_wheel_of_emotions">Wheel of Emotions</a> identifies 8 basic emotions: Joy, Acceptance, Fear, Surprise, Sadness, Disgust, Anger, and Anticipation.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7vrC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7vrC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png 424w, https://substackcdn.com/image/fetch/$s_!7vrC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png 848w, https://substackcdn.com/image/fetch/$s_!7vrC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png 1272w, https://substackcdn.com/image/fetch/$s_!7vrC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7vrC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7vrC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png 424w, https://substackcdn.com/image/fetch/$s_!7vrC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png 848w, https://substackcdn.com/image/fetch/$s_!7vrC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png 1272w, https://substackcdn.com/image/fetch/$s_!7vrC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F591bdce3-f307-40c2-b020-2bf669a6d0a2_715x725.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Plutchik&#8217;s Wheel of Emotions. Source: Wikipedia (released in Public&nbsp;Domain)</figcaption></figure></div><p>The 8 basic emotions are divided into 4 pairs, for example Fear is the opposite of Anger, Disgust is the opposite of Trust, Sadness is the negative of Joy, and Surprise is the opposite of Anticipation. Emotional intensity can vary, and taking Sadness as an example, on the extreme end we are looking at Grief, while low intensity Sadness is Pensiveness.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Icd8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Icd8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png 424w, https://substackcdn.com/image/fetch/$s_!Icd8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png 848w, https://substackcdn.com/image/fetch/$s_!Icd8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png 1272w, https://substackcdn.com/image/fetch/$s_!Icd8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Icd8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Icd8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png 424w, https://substackcdn.com/image/fetch/$s_!Icd8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png 848w, https://substackcdn.com/image/fetch/$s_!Icd8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png 1272w, https://substackcdn.com/image/fetch/$s_!Icd8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3931e196-7b1f-417f-acd9-111eda06ead3_800x223.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Plutchik emotion&nbsp;pairs</figcaption></figure></div><p>More complex human emotions, according to Plutchik&#8217;s research, can be thought of as combinations of two basic ones. For example, Love is Joy + Trust, while Pessimism is Sadness + Anticipation.</p><p>Plutchik&#8217;s emotion theory is not the only game in town for sentiment analysis; there are other emotion frameworks used by psychologists and researchers. However, Plutchik has something that other frameworks lack, and that is it&#8217;s perfect suitability for algorithmic implementation. As we have 4 pairs of emotions, we can use a relatively simple arithmetic for processing negations. And the composability of derived emotions allows us to build search patterns that consist of simple AND operations. Lastly, we can use the emotional intensity gradings, e.g. ranging from Pensiveness to Sadness to Grief on the Sadness emotional dimension, to process degree adverbs, e.g. &#8220;very sad&#8221; is Grief.</p><p>When a computer program attempts to understand the emotions people try to express in text, it faces dozens of signals that indicate the presence of emotions. In addition to adjectives and nouns such as &#8220;cranky&#8221; or &#8220;merry&#8221; representing emotions, strong signals are sent by linguistic constructs such as idioms, adjective phrases, adverbs and others, including:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!swTf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!swTf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png 424w, https://substackcdn.com/image/fetch/$s_!swTf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png 848w, https://substackcdn.com/image/fetch/$s_!swTf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png 1272w, https://substackcdn.com/image/fetch/$s_!swTf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!swTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae4b25d1-0567-493e-bc21-a50176219575_800x414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!swTf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png 424w, https://substackcdn.com/image/fetch/$s_!swTf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png 848w, https://substackcdn.com/image/fetch/$s_!swTf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png 1272w, https://substackcdn.com/image/fetch/$s_!swTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae4b25d1-0567-493e-bc21-a50176219575_800x414.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Signals of opinions in&nbsp;text</figcaption></figure></div><p>The ability to process abbreviations, interjections, capitalization and emoticons is very handy for dealing with informal text including tweets and blog comments. In fact, when processing informal text so common in Twitter, the goals of sentiment analysis are a bit different than when processing fully formed sentences of an WSJ editorial. Building sentence trees is almost useless for tweets. There, after we tokenized sentences into words, and ran part-of-speech tagging, we stop at the chunking step and do not proceed to parsing at all.</p><p>The flat, sequential nature of written text also imposes some very unique challenges for a computer algorithm. For example, should emotion signals in quotes be associated with the author of the news post that is being analyzed? (In our opinion, they should not) How should we deal with emotions in question sentences? (The person asking the question seldom carries that emotion). What about the favorite example in every sentiment analysis textbook&#8202;&#8212;&#8202;irony? The words say one thing, while both the reader and the writer know they mean something else, based on the context of the previous conversation. To handle questions, quotes, and irony, we introduced a special sentiment dimension and called it &#8220;Ambiguous&#8221;. &#8220;Ambiguous&#8221; sentiment is a catch-all sentiment that we have to use when we find phrases that would be normally one of the 8 basic emotions (according to Plutchik&#8217;s theory), but are located in quotes or in question sentences or are possibly ironic.</p><p>So far we talked about emotions in text, but there are other interesting signals in text that could be considered opinions&#8202;&#8212;&#8202;word phrases representing qualities, such as &#8220;correct&#8221; or &#8220;elegant&#8221;. Expressions of quality cannot typically be mapped to one of the Plutchik&#8217;s emotion, but they do usually carry a positive or negative connotation. Sometimes quality words can have different meaning depending on the context, e.g. &#8220;massive flood&#8221; is negative, while &#8220;massive investment&#8221; is positive. We use our idiom dictionary to determine the semantic orientation, but when we can&#8217;t find a known idiom, we again use the Ambiguous dimension to flag such a phrase occurrence.</p><p>Lastly, when processing text sourced from the Internet, especially from non-moderated sites, one comes across profanity that one would not want to republish or repost in a research paper or site. We&#8217;ve added the ability to recognize several thousand words and idioms that are considered racist, pornographic, and profane, and flag such occurrences under the &#8220;profane&#8221; dimension. We also added the &#8220;unsafe&#8221; sentiment dimension to flag phrases that refer to sexuality, drug use, criminal activity and other potentially delicate opinions.</p><p>Let&#8217;s also talk about what happens to the entire news article once we processed all constituent sentences of that article. Knowing the subjects of each sentence allows us calculating statistics for the entire article of what the most important subjects are, taking into account the frequency of their occurrence, and their placement in sentences with strong sentiment. Our algorithm produces up to 7 tags to classify the topic of the article (we borrowed the term Tag from the social media context). The tags, together with all the opinions we extracted from the article form what we call a Content Index of the article. Here is an example of a Content Index for an <a href="http://www.hindustantimes.com/football/napoli-benfica-face-showdown-for-uefa-champions-league-last-16-spot/story-7WS3BRAGzm8ssh1EIKkUzJ.html">article</a> reporting on a UEFA soccer game between the Besiktas and Napoli soccer clubs.</p><pre><code>TEXT STATS Length [3408] Paragraphs [18] Unique Entities [118]</code></pre><pre><code>TAGS Begin:</code></pre><pre><code>8.5 "Besiktas" [Good Topic]</code></pre><pre><code>6.0 "14-time national champion refusal" [Good Topic]</code></pre><pre><code>5.9 "Liverpool historic 2005 Istanbul triumph" [Good Topic]</code></pre><pre><code>5.5 "Napoli" [Good Topic]</code></pre><pre><code>4.9 "quarter-final appearance" [Good Topic]</code></pre><pre><code>4.7 "runners-up Bayern Munich" [Good Topic]</code></pre><pre><code>4.6 "Turkish" [Good Topic]</code></pre><pre><code>TOP SENTIMENTS Begin:</code></pre><pre><code>Sentiment {0} Tags: Besiktas, quarter-final appearance</code></pre><pre><code>Sentiment {0} Dominant Valence: positive</code></pre><pre><code>Sentiment {0} Total Sentiment Score: 34.0</code></pre><pre><code>Sentiment {0} Annotated Text: {{Besiktas}} will &lt;&lt;hope&gt;&gt; that their &lt;&lt;best&gt;&gt; previous showing&#8212; a 1987 European Cup {{quarter-final appearance}}&#8212; will not prove a &lt;&lt;poor&gt;&gt; omen.</code></pre><pre><code>Sentiment {0} Serialized Representation: V2|2|Besiktas|0|7|1|equarter-final appearance|73|96|1|e3|14|17|1|t116|119|1|n30|33|1|p</code></pre><pre><code>Sentiment {1} Tags: Besiktas, Liverpool historic 2005 Istanbul triumph, Turkish</code></pre><pre><code>Sentiment {1} Dominant Valence: positive</code></pre><pre><code>Sentiment {1} Total Sentiment Score: 30.0</code></pre><pre><code>Sentiment {1} Annotated Text: That {{Turkish}} comeback, rekindling memories of {{Liverpool's historic 2005 Istanbul &lt;&lt;triumph}&gt;, has left the group outcome on a knife-edge with {{Besiktas}} knowing that a &lt;&lt;win&gt;&gt; at &lt;&lt;eliminated&gt;&gt; Dynamo Kiev will send them through.</code></pre><pre><code>Sentiment {1} Serialized Representation: V2|3|Besiktas|138|145|1|eLiverpool historic 2005 Istanbul triumph|46|87|1|eTurkish|5|11|1|e3|169|178|1|n81|87|1|j162|164|1|j</code></pre><pre><code>Sentiment {2} Tags: Besiktas</code></pre><pre><code>Sentiment {2} Dominant Valence: positive</code></pre><pre><code>Sentiment {2} Total Sentiment Score: 20.0</code></pre><pre><code>Sentiment {2} Annotated Text: The winner of Benfica-Napoli in Lisbon will &lt;&lt;qualify&gt;&gt; as group winners, while a draw would be enough to send the latter through given their head-to-head edge. If {{Besiktas}} lose, both will &lt;&lt;progress&gt;&gt; regardless.</code></pre><pre><code>Sentiment {2} Serialized Representation: V2|1|Besiktas|160|167|1|e2|44|50|1|p185|192|1|p</code></pre><pre><code>Sentiment {3} Tags: Besiktas</code></pre><pre><code>Sentiment {3} Dominant Valence: negative</code></pre><pre><code>Sentiment {3} Total Sentiment Score: 20.0</code></pre><pre><code>Sentiment {3} Annotated Text: Portugal's Benfica, European champions back in 1961 and 1962 in their &lt;&lt;halcyon&gt;&gt; &lt;&lt;days&gt;&gt; but who have underachieved since, should have secured their passage on matchday five but &lt;&lt;threw&gt;&gt; away a three-goal lead at {{Besiktas}}.</code></pre><pre><code>Sentiment {3} Serialized Representation: V2|1|Besiktas|204|211|1|e3|172|176|1|s78|81|1|g70|76|1|g</code></pre><p>I hope I was able to provide you with more insight into how <a href="https://github.com/datancoffee/sirocco">Sirocco</a> works. As a parting thought, here is why I think the knowledge of how a person feels towards a subject matters.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y-Ag!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y-Ag!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png 424w, https://substackcdn.com/image/fetch/$s_!y-Ag!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png 848w, https://substackcdn.com/image/fetch/$s_!y-Ag!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png 1272w, https://substackcdn.com/image/fetch/$s_!y-Ag!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y-Ag!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7399303-867d-46e1-9479-442c11830b42_800x247.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!y-Ag!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png 424w, https://substackcdn.com/image/fetch/$s_!y-Ag!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png 848w, https://substackcdn.com/image/fetch/$s_!y-Ag!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png 1272w, https://substackcdn.com/image/fetch/$s_!y-Ag!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7399303-867d-46e1-9479-442c11830b42_800x247.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Commercial/practical uses of sentiment/opinion analysis of&nbsp;text</figcaption></figure></div><p>When you know that someone is anticipating or is interested in something, you can work with that person to see if you can offer them what they are looking for. Alternatively, when someone is experiencing sadness, disgust, or anger towards a subject, then it&#8217;s a good indicator that they are looking for an alternative. When someone is positive, or experiences joy about a subject, you can use that to influence positively other people in their circle. Conversely, when someone is negative about something, you can use that to promote the alternative.</p>]]></content:encoded></item><item><title><![CDATA[Selecting a Java WordNet API for lemma lookups]]></title><description><![CDATA[This is the part two blog post of the Sirocco &#8220;modernization&#8221; series.]]></description><link>https://www.datancoff.ee/p/selecting-a-java-wordnet-api-for-lemma-lookups</link><guid isPermaLink="false">https://www.datancoff.ee/p/selecting-a-java-wordnet-api-for-lemma-lookups</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Sun, 23 Apr 2017 19:05:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is the <strong>part two</strong> blog post of the Sirocco &#8220;modernization&#8221; series.</p><p>In the old, SharpNLP version of Sirocco, we used WordNet version 2.7 to look up the base forms, aka lemmas, of words. For example, if you have a verb &#8220;was&#8221;, its lemma, or base form, is &#8220;be&#8221;. Knowing base forms of words is important for us because we use pattern rules for English idioms that are defined using base forms of words. An example of an English idiom is &#8220;rub it in&#8221;, defined in Sirocco patterns as &#8220;rub/VB it/PRP in/IN&#8221;. This pattern allows us catching variations of the phrase, e.g. &#8220;rubbed it in&#8221;, &#8220;rubbing it in&#8221; etc.</p><p>WordNet is a large dictionary of English words (look up any word in this simple<a href="http://wordnetweb.princeton.edu/perl/webwn"> web UI</a>) with a Java and many other libraries to access it programmatically. Its <a href="http://wordnet.princeton.edu/wordnet/download/current-version/">latest version</a> is 3.1 and I was looking for a Java API that would allow me accessing the newest dictionaries and be able to run them in the cloud. The second requirement had a good reason. As I was reading about the different options, I realized that the best way to accomplish efficient execution in the cloud was to have the WordNet dictionary files packaged in a jar, and loaded as resources. I could not use WordNet dictionaries deployed as databases, or as files on the file system, as this would quickly break in the auto-scaled, auto-deployed world of modern clouds.</p><p>First, I looked at OpenNLP. OpenNLP version 1.6 supports WordNet via an add-on, which in turn is using the <a href="https://mvnrepository.com/artifact/net.sf.jwordnet/jwnl/">JWNL library</a>&nbsp;, specifically, they have the <a href="http://svn.apache.org/repos/asf/opennlp/addons/jwnl-addon/src/main/java/opennlp/jwnl/lemmatizer/JWNLLemmatizer.java">JWNLLemmatizer</a> class. If you want to use it in your project, add the following dependency to your POM (but read to the end of this blog post, before you do that).</p><pre><code>&lt;dependency&gt;
  &lt;groupId&gt;net.sf.jwordnet&lt;/groupId&gt;
  &lt;artifactId&gt;jwnl&lt;/artifactId&gt;
  &lt;version&gt;1.3.3&lt;/version&gt;
  &lt;scope&gt;compile&lt;/scope&gt;
&lt;/dependency&gt;</code></pre><p>However, the JWNLLemmatizer from the 1.6 version of OpenNLP only returned a single base form for a word, which isn&#8217;t true for a significant portion of English words. Interestingly, the OpenNLP version 1.5.3 has the <a href="http://www.javased.com/index.php?source_dir=opennlp/opennlp-tools/src/main/java/opennlp/tools/coref/mention/JWNLDictionary.java">JWNLDictionary</a> class which returns a list of base forms. I almost ended up using that class, but didn&#8217;t for reasons I will explain below.</p><p>After getting JWNLDictionary to work locally, I realized that I just could not deploy it to my preferred cloud provider. The OpenNLP add-on assumed it could access dictionary files directly on disk, which would not work with my data processing setup. After some researching I came across this Stackoverflow <a href="http://stackoverflow.com/questions/15258950/wordnet-java-api">post</a> which provided me with more options. I did not try every one, I admit, but I tried many. The package that did the trick was Extended JWNL, or <a href="https://github.com/extjwnl/">ExtJWNL</a> for short. I used version 1.8, and here is the necessary dependency declaration for your POM:</p><pre><code>&lt;dependency&gt;
 &lt;groupId&gt;net.sf.extjwnl&lt;/groupId&gt;
 &lt;artifactId&gt;extjwnl&lt;/artifactId&gt;
 &lt;version&gt;1.8.0&lt;/version&gt;
&lt;/dependency&gt;</code></pre><pre><code>&lt;dependency&gt;
 &lt;groupId&gt;net.sf.extjwnl&lt;/groupId&gt;
 &lt;artifactId&gt;extjwnl-data-wn31&lt;/artifactId&gt;
 &lt;version&gt;1.2&lt;/version&gt;
&lt;/dependency&gt;</code></pre><p>The second dependency is for the WordNet data files&nbsp;, btw.</p><p>Here is what you need to do to start using it.</p><pre><code>import net.sf.extjwnl.dictionary.Dictionary;</code></pre><pre><code>... declare your Dictionary variable ...</code></pre><pre><code>private Dictionary wnDict = null;</code></pre><pre><code>... initializing in constructor ...</code></pre><pre><code>String propsFile = ConfigurationManager.getConfiguration().getString(&#8220;WordnetPropertiesFile&#8221;);
InputStream stream = getClass().getResourceAsStream(propsFile);
wnDict = Dictionary.getInstance(stream);</code></pre><p>Then, from your methods that need to read the base forms (or lemmas) of words, you can query the Dictionary.</p><pre><code>// calling the WN dictionary</code></pre><pre><code>POS fullposobj = POS.getPOSForLabel(fullpos);
List&lt;String&gt; bf = (List&lt;String&gt;) wnDict.getMorphologicalProcessor().lookupAllBaseForms(fullposobj,lowercaseLemma);</code></pre><p>Here is how Sirocco is doing it in its <a href="https://github.com/datancoffee/sirocco/blob/master/src/main/java/sirocco/indexer/dictionaries/en/BaseFormsDictionary.java">BaseFormsDictionary</a> class.</p><p>Btw, if you are wondering about the &#8220;WordnetPropertiesFile&#8221; property in the snippet above, here it is.</p><pre><code>&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;jwnl_properties language="en"&gt;
    &lt;version publisher="Princeton" number="3.1" language="en"/&gt;
    &lt;dictionary class="net.sf.extjwnl.dictionary.FileBackedDictionary"&gt;
        &lt;param name="morphological_processor" value="net.sf.extjwnl.dictionary.morph.DefaultMorphologicalProcessor"&gt;
            &lt;param name="operations"&gt;
                &lt;param value="net.sf.extjwnl.dictionary.morph.LookupExceptionsOperation"/&gt;
                &lt;param value="net.sf.extjwnl.dictionary.morph.DetachSuffixesOperation"&gt;
                    &lt;param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/&gt;
                    &lt;param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/&gt;
                    &lt;param name="adjective" value="|er=|est=|er=e|est=e|"/&gt;
                    &lt;param name="operations"&gt;
                        &lt;param value="net.sf.extjwnl.dictionary.morph.LookupIndexWordOperation"/&gt;
                        &lt;param value="net.sf.extjwnl.dictionary.morph.LookupExceptionsOperation"/&gt;
                    &lt;/param&gt;
                &lt;/param&gt;
                &lt;param value="net.sf.extjwnl.dictionary.morph.TokenizerOperation"&gt;
                    &lt;param name="delimiters"&gt;
                        &lt;param value=" "/&gt;
                        &lt;param value="-"/&gt;
                    &lt;/param&gt;
                    &lt;param name="token_operations"&gt;
                        &lt;param value="net.sf.extjwnl.dictionary.morph.LookupIndexWordOperation"/&gt;
                        &lt;param value="net.sf.extjwnl.dictionary.morph.LookupExceptionsOperation"/&gt;
                        &lt;param value="net.sf.extjwnl.dictionary.morph.DetachSuffixesOperation"&gt;
                            &lt;param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/&gt;
                            &lt;param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/&gt;
                            &lt;param name="adjective" value="|er=|est=|er=e|est=e|"/&gt;
                            &lt;param name="operations"&gt;
                                &lt;param value="net.sf.extjwnl.dictionary.morph.LookupIndexWordOperation"/&gt;
                                &lt;param value="net.sf.extjwnl.dictionary.morph.LookupExceptionsOperation"/&gt;
                            &lt;/param&gt;
                        &lt;/param&gt;
                    &lt;/param&gt;
                &lt;/param&gt;
            &lt;/param&gt;
        &lt;/param&gt;
        &lt;param name="dictionary_element_factory"
               value="net.sf.extjwnl.princeton.data.PrincetonWN17FileDictionaryElementFactory"/&gt;
        &lt;param name="file_manager" value="net.sf.extjwnl.dictionary.file_manager.FileManagerImpl"&gt;
            &lt;param name="check_path" value="false"/&gt;
            &lt;param name="file_type" value="net.sf.extjwnl.princeton.file.PrincetonResourceDictionaryFile"/&gt;
            &lt;param name="dictionary_path" value="/net/sf/extjwnl/data/wordnet/wn31"/&gt;
        &lt;/param&gt;
    &lt;/dictionary&gt;
    &lt;resource class="net.sf.extjwnl.princeton.PrincetonResource"/&gt;
&lt;/jwnl_properties&gt;</code></pre><p>ExtJWNL supports multiple dictionary storage backends (Memory, Database, File Backed) and uses property files to parameterize them. I used, as you can see, the File Backed dictionary, and all the WordNet files are located in the extjwnl-data-wn31&#8211;1.2.jar that I deploy to my processing infrastructure when I run Sirocco in the cloud.</p><p>In my next blog post I will write a bit more about the parts of OpenNLP I use to do sentiment analysis on text.</p>]]></content:encoded></item><item><title><![CDATA[Modernizing Sirocco from C# and SharpNLP to Java and Apache OpenNLP]]></title><description><![CDATA[Tl;dr: Automatic conversion from C# to Java is possible!]]></description><link>https://www.datancoff.ee/p/modernizing-sirocco-from-c-and-sharpnlp-to-java-and-apache-opennlp</link><guid isPermaLink="false">https://www.datancoff.ee/p/modernizing-sirocco-from-c-and-sharpnlp-to-java-and-apache-opennlp</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Thu, 20 Apr 2017 19:10:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Tl;dr:</strong> Automatic conversion from C# to Java is possible!</p><p>When we set out developing Cuesense in late 2000s, my partners and I standardized on the Microsoft stack mostly for skills reasons. Our sentiment analysis piece was built using SharpNLP, the C# clone of OpenNLP. It worked fine for the time, but when I decided to put it back to action recently, I faced a predictable challenge&#8202;&#8212;&#8202;the SharpNLP project was mothballed and I had my doubts that I would find many people interested in maintaining the C# version. On the other hand, OpenNLP&#8217;s Java branch actually graduated to the Top Level Project status at Apache and showed promise of consistent maintenance.</p><p>Java was not new to me, in fact, I used to develop pretty complex distributed systems in it before the first Internet bubble (Telco CLEC OSSes, anyone?). Moreover, C# is not that different from Java and I was hoping I could find an automated converter that would help me at least with part of the rewrite exercise.</p><p>In addition to converting the C# code base to Java and replacing SharpNLP with the latest Apache OpenNLP version 1.6, I also needed to convert the analytics database that was implemented in MySQL, with a bunch of stored procedures doing trends calculations. I will focus this blog post on the C# conversion and will write more about MySQL conversion in a later post.</p><p>I googled &#8220;C# to Java conversions&#8221; and came across this StackOverflow <a href="http://stackoverflow.com/questions/7029904/looking-for-the-code-converter-which-converts-c-sharp-to-java">post</a>. It listed several options for automated converters:</p><ul><li><p><a href="https://github.com/twiglet/cs2j">CS2J</a></p></li><li><p>Grasshopper</p></li><li><p>CSharpJavaMerger</p></li><li><p>Tangible Software <a href="http://www.tangiblesoftwaresolutions.com/Product_Details/CSharp_to_Java_Converter_Details.html">C# to Java Converter</a></p></li></ul><p>Long story short, the CS2J tool did a pretty awesome job, although it did require installing Mono and running a Windows-native app on my Mac. Here are the steps I took to do the initial conversion of my code base:</p><ol><li><p>Install <a href="http://www.mono-project.com/">Mono</a> to run CS2J</p></li></ol><p>2. Install GitHub Desktop (because I could never remember the git commands, after polluting my memory with subversion, Visual SourceSafe and a bunch of other source control systems)</p><p>3. Get CS2J from <a href="https://github.com/twiglet/cs2j">GitHub</a> or download from <a href="http://www.cs2j.com/">http://www.cs2j.com/</a></p><p>4. Install Homebrew to be able to install ant and maven</p><pre><code>/usr/bin/ruby -e &#8220;$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"</code></pre><p>5. Run CS2J</p><pre><code>cd cs2j</code></pre><pre><code>mono CS2JTranslator/bin/cs2j.exe -net-templates-dir=./NetFramework/ -out-java-dir=&lt;project-dir&gt;/JavaOutput -app-dir=&lt;project-dir&gt;/Common -cs-dir=&lt;project-dir&gt;/Common</code></pre><pre><code>mono CS2JTranslator/bin/cs2j.exe -net-templates-dir=./NetFramework/ -out-java-dir=&lt;project-dir&gt;/JavaOutput -app-dir=&lt;project-dir&gt;/Servers -cs-dir=&lt;project-dir&gt;/Servers</code></pre><p>The JavaOutput directory was the directory were I wanted my Java code to end up, and the Common and Servers directories were folders were I kept my C# code to be converted.</p><p>SUCCESS!</p><p>Now that I had the initial output of the converter, the fun was just starting, as there were hundreds of compile errors that I had to resolve. But more on this in a <a href="https://medium.com/@datancoffee/selecting-a-java-wordnet-api-for-lemma-lookups-7fae7a273f91">future blog post</a>.</p>]]></content:encoded></item><item><title><![CDATA[Sirocco released under Apache 2.0 license to github]]></title><description><![CDATA[It&#8217;s been a long time in making, but we are finally ready to release Sirocco &#8212; the opinion extraction library based on Robert Plutchik&#8217;s emotion framework &#8212; under the Apache 2.0 license to github.]]></description><link>https://www.datancoff.ee/p/sirocco-released-under-apache-2-0-license-to-github</link><guid isPermaLink="false">https://www.datancoff.ee/p/sirocco-released-under-apache-2-0-license-to-github</guid><dc:creator><![CDATA[Serhii Sokolenko]]></dc:creator><pubDate>Sun, 16 Apr 2017 19:14:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dhJG!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fb008d0-9f51-45d9-a807-90edecf46ffd_700x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It&#8217;s been a long time in making, but we are finally ready to release <a href="https://github.com/datancoffee/sirocco">Sirocco</a>&#8202;&#8212;&#8202;the opinion extraction library based on <a href="https://en.wikipedia.org/wiki/Robert_Plutchik">Robert Plutchik</a>&#8217;s emotion framework&#8202;&#8212;&#8202;under the Apache 2.0 license to github. Sirocco was developed during my time at <a href="http://cuesense.com">Cuesense</a> in early 2010s and over time I&#8217;ve been tinkering with it, most recently converting it from C# to Java, making it compatible with the latest Apache <a href="https://opennlp.apache.org/">OpenNLP</a> version, and making it run with Google <a href="http://cloud.google.com/dataflow/">Cloud Dataflow</a>. My Cuesense buddies and contributors&#8202;&#8212;&#8202;Alexey Shevchuk, Sergey Shevchook, and Roman Khnykin&#8202;&#8212;&#8202;have all agreed to release this code to the broad NLP and Machine Learning community. Thank you, guys, for your hard work on Cuesense and Sirocco, and may many exciting text analytics projects ensue!</p>]]></content:encoded></item></channel></rss>